Files
gh-k-dense-ai-claude-scient…/skills/scvi-tools/references/theoretical-foundations.md
2025-11-30 08:30:10 +08:00

439 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Theoretical Foundations of scvi-tools
This document explains the mathematical and statistical principles underlying scvi-tools.
## Core Concepts
### Variational Inference
**What is it?**
Variational inference is a technique for approximating complex probability distributions. In single-cell analysis, we want to understand the posterior distribution p(z|x) - the probability of latent variables z given observed data x.
**Why use it?**
- Exact inference is computationally intractable for complex models
- Scales to large datasets (millions of cells)
- Provides uncertainty quantification
- Enables Bayesian reasoning about cell states
**How does it work?**
1. Define a simpler approximate distribution q(z|x) with learnable parameters
2. Minimize the KL divergence between q(z|x) and true posterior p(z|x)
3. Equivalent to maximizing the Evidence Lower Bound (ELBO)
**ELBO Objective**:
```
ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
↑ ↑
Reconstruction Regularization
```
- **Reconstruction term**: Model should generate data similar to observed
- **Regularization term**: Latent representation should match prior
### Variational Autoencoders (VAEs)
**Architecture**:
```
x (observed data)
[Encoder Neural Network]
z (latent representation)
[Decoder Neural Network]
x̂ (reconstructed data)
```
**Encoder**: Maps cells (x) to latent space (z)
- Learns q(z|x), the approximate posterior
- Parameterized by neural network with learnable weights
- Outputs mean and variance of latent distribution
**Decoder**: Maps latent space (z) back to gene space
- Learns p(x|z), the likelihood
- Generates gene expression from latent representation
- Models count distributions (Negative Binomial, Zero-Inflated NB)
**Reparameterization Trick**:
- Allows backpropagation through stochastic sampling
- Sample z = μ + σ ⊙ ε, where ε ~ N(0,1)
- Enables end-to-end training with gradient descent
### Amortized Inference
**Concept**: Share encoder parameters across all cells.
**Traditional inference**: Learn separate latent variables for each cell
- n_cells × n_latent parameters
- Doesn't scale to large datasets
**Amortized inference**: Learn single encoder for all cells
- Fixed number of parameters regardless of cell count
- Enables fast inference on new cells
- Transfers learned patterns across dataset
**Benefits**:
- Scalable to millions of cells
- Fast inference on query data
- Leverages shared structure across cells
- Enables few-shot learning
## Statistical Modeling
### Count Data Distributions
Single-cell data are counts (integer-valued), requiring appropriate distributions.
#### Negative Binomial (NB)
```
x ~ NB(μ, θ)
```
- **μ (mean)**: Expected expression level
- **θ (dispersion)**: Controls variance
- **Variance**: Var(x) = μ + μ²/θ
**When to use**: Gene expression without zero-inflation
- More flexible than Poisson (allows overdispersion)
- Models technical and biological variation
#### Zero-Inflated Negative Binomial (ZINB)
```
x ~ π·δ₀ + (1-π)·NB(μ, θ)
```
- **π (dropout rate)**: Probability of technical zero
- **δ₀**: Point mass at zero
- **NB(μ, θ)**: Expression when not dropped out
**When to use**: Sparse scRNA-seq data
- Models technical dropout separately from biological zeros
- Better fit for highly sparse data (e.g., 10x data)
#### Poisson
```
x ~ Poisson(μ)
```
- Simplest count distribution
- Mean equals variance: Var(x) = μ
**When to use**: Less common; ATAC-seq fragment counts
- More restrictive than NB
- Faster computation
### Batch Correction Framework
**Problem**: Technical variation confounds biological signal
- Different sequencing runs, protocols, labs
- Must remove technical effects while preserving biology
**scvi-tools approach**:
1. Encode batch as categorical variable s
2. Include s in generative model
3. Latent space z is batch-invariant
4. Decoder conditions on s for batch-specific effects
**Mathematical formulation**:
```
Encoder: q(z|x, s) - batch-aware encoding
Latent: z - batch-corrected representation
Decoder: p(x|z, s) - batch-specific decoding
```
**Key insight**: Batch info flows through decoder, not latent space
- z captures biological variation
- s explains technical variation
- Separable biology and batch effects
### Deep Generative Modeling
**Generative model**: Learns p(x), the data distribution
**Process**:
1. Sample latent variable: z ~ p(z) = N(0, I)
2. Generate expression: x ~ p(x|z)
3. Joint distribution: p(x, z) = p(x|z)p(z)
**Benefits**:
- Generate synthetic cells
- Impute missing values
- Quantify uncertainty
- Perform counterfactual predictions
**Inference network**: Inverts generative process
- Given x, infer z
- q(z|x) approximates true posterior p(z|x)
## Model Architecture Details
### scVI Architecture
**Input**: Gene expression counts x ∈ ^G (G genes)
**Encoder**:
```
h = ReLU(W₁·x + b₁)
μ_z = W₂·h + b₂
log σ²_z = W₃·h + b₃
z ~ N(μ_z, σ²_z)
```
**Latent space**: z ∈ ^d (typically d=10-30)
**Decoder**:
```
h = ReLU(W₄·z + b₄)
μ = softmax(W₅·h + b₅) · library_size
θ = exp(W₆·h + b₆)
π = sigmoid(W₇·h + b₇) # for ZINB
x ~ ZINB(μ, θ, π)
```
**Loss function (ELBO)**:
```
L = E_q[log p(x|z)] - KL(q(z|x) || N(0,I))
```
### Handling Covariates
**Categorical covariates** (batch, donor, etc.):
- One-hot encoded: s ∈ {0,1}^K
- Concatenate with latent: [z, s]
- Or use conditional layers
**Continuous covariates** (library size, percent_mito):
- Standardize to zero mean, unit variance
- Include in encoder and/or decoder
**Covariate injection strategies**:
- **Concatenation**: [z, s] fed to decoder
- **Deep injection**: s added at multiple layers
- **Conditional batch norm**: Batch-specific normalization
## Advanced Theoretical Concepts
### Transfer Learning (scArches)
**Concept**: Use pretrained model as initialization for new data
**Process**:
1. Train reference model on large dataset
2. Freeze encoder parameters
3. Fine-tune decoder on query data
4. Or fine-tune all with lower learning rate
**Why it works**:
- Encoder learns general cellular representations
- Decoder adapts to query-specific characteristics
- Prevents catastrophic forgetting
**Applications**:
- Query-to-reference mapping
- Few-shot learning for rare cell types
- Rapid analysis of new datasets
### Multi-Resolution Modeling (MrVI)
**Idea**: Separate shared and sample-specific variation
**Latent space decomposition**:
```
z = z_shared + z_sample
```
- **z_shared**: Common across samples
- **z_sample**: Sample-specific effects
**Hierarchical structure**:
```
Sample level: ρ_s ~ N(0, I)
Cell level: z_i ~ N(ρ_{s(i)}, σ²)
```
**Benefits**:
- Disentangle biological sources of variation
- Compare samples at different resolutions
- Identify sample-specific cell states
### Counterfactual Prediction
**Goal**: Predict outcome under different conditions
**Example**: "What would this cell look like if from different batch?"
**Method**:
1. Encode cell to latent: z = Encoder(x, s_original)
2. Decode with new condition: x_new = Decoder(z, s_new)
3. x_new is counterfactual prediction
**Applications**:
- Batch effect assessment
- Predicting treatment response
- In silico perturbation studies
### Posterior Predictive Distribution
**Definition**: Distribution of new data given observed data
```
p(x_new | x_observed) = ∫ p(x_new|z) q(z|x_observed) dz
```
**Estimation**: Sample z from q(z|x), generate x_new from p(x_new|z)
**Uses**:
- Uncertainty quantification
- Robust predictions
- Outlier detection
## Differential Expression Framework
### Bayesian Approach
**Traditional methods**: Compare point estimates
- Wilcoxon, t-test, etc.
- Ignore uncertainty
- Require pseudocounts
**scvi-tools approach**: Compare distributions
- Sample from posterior: μ_A ~ p(μ|x_A), μ_B ~ p(μ|x_B)
- Compute log fold-change: LFC = log(μ_B) - log(μ_A)
- Posterior distribution of LFC quantifies uncertainty
### Bayes Factor
**Definition**: Ratio of posterior odds to prior odds
```
BF = P(H₁|data) / P(H₀|data)
─────────────────────────
P(H₁) / P(H₀)
```
**Interpretation**:
- BF > 3: Moderate evidence for H₁
- BF > 10: Strong evidence
- BF > 100: Decisive evidence
**In scvi-tools**: Used to rank genes by evidence for DE
### False Discovery Proportion (FDP)
**Goal**: Control expected false discovery rate
**Procedure**:
1. For each gene, compute posterior probability of DE
2. Rank genes by evidence (Bayes factor)
3. Select top k genes such that E[FDP] ≤ α
**Advantage over p-values**:
- Fully Bayesian
- Natural for posterior inference
- No arbitrary thresholds
## Implementation Details
### Optimization
**Optimizer**: Adam (adaptive learning rates)
- Default lr = 0.001
- Momentum parameters: β₁=0.9, β₂=0.999
**Training loop**:
1. Sample mini-batch of cells
2. Compute ELBO loss
3. Backpropagate gradients
4. Update parameters with Adam
5. Repeat until convergence
**Convergence criteria**:
- ELBO plateaus on validation set
- Early stopping prevents overfitting
- Typically 200-500 epochs
### Regularization
**KL annealing**: Gradually increase KL weight
- Prevents posterior collapse
- Starts at 0, increases to 1 over epochs
**Dropout**: Random neuron dropping during training
- Default: 0.1 dropout rate
- Prevents overfitting
- Improves generalization
**Weight decay**: L2 regularization on weights
- Prevents large weights
- Improves stability
### Scalability
**Mini-batch training**:
- Process subset of cells per iteration
- Batch size: 64-256 cells
- Enables scaling to millions of cells
**Stochastic optimization**:
- Estimates ELBO on mini-batches
- Unbiased gradient estimates
- Converges to optimal solution
**GPU acceleration**:
- Neural networks naturally parallelize
- Order of magnitude speedup
- Essential for large datasets
## Connections to Other Methods
### vs. PCA
- **PCA**: Linear, deterministic
- **scVI**: Nonlinear, probabilistic
- **Advantage**: scVI captures complex structure, handles counts
### vs. t-SNE/UMAP
- **t-SNE/UMAP**: Visualization-focused
- **scVI**: Full generative model
- **Advantage**: scVI enables downstream tasks (DE, imputation)
### vs. Seurat Integration
- **Seurat**: Anchor-based alignment
- **scVI**: Probabilistic modeling
- **Advantage**: scVI provides uncertainty, works for multiple batches
### vs. Harmony
- **Harmony**: PCA + batch correction
- **scVI**: VAE-based
- **Advantage**: scVI handles counts natively, more flexible
## Mathematical Notation
**Common symbols**:
- x: Observed gene expression (counts)
- z: Latent representation
- θ: Model parameters
- q(z|x): Approximate posterior (encoder)
- p(x|z): Likelihood (decoder)
- p(z): Prior on latent variables
- μ, σ²: Mean and variance
- π: Dropout probability (ZINB)
- θ (in NB): Dispersion parameter
- s: Batch/covariate indicator
## Further Reading
**Key Papers**:
1. Lopez et al. (2018): "Deep generative modeling for single-cell transcriptomics"
2. Xu et al. (2021): "Probabilistic harmonization and annotation of single-cell transcriptomics"
3. Boyeau et al. (2019): "Deep generative models for detecting differential expression in single cells"
**Concepts to explore**:
- Variational inference in machine learning
- Bayesian deep learning
- Information theory (KL divergence, mutual information)
- Generative models (GANs, normalizing flows, diffusion models)
- Probabilistic programming (Pyro, PyTorch)
**Mathematical background**:
- Probability theory and statistics
- Linear algebra and calculus
- Optimization theory
- Information theory