Files
gh-k-dense-ai-claude-scient…/skills/scvi-tools/references/theoretical-foundations.md
2025-11-30 08:30:10 +08:00

11 KiB
Raw Blame History

Theoretical Foundations of scvi-tools

This document explains the mathematical and statistical principles underlying scvi-tools.

Core Concepts

Variational Inference

What is it? Variational inference is a technique for approximating complex probability distributions. In single-cell analysis, we want to understand the posterior distribution p(z|x) - the probability of latent variables z given observed data x.

Why use it?

  • Exact inference is computationally intractable for complex models
  • Scales to large datasets (millions of cells)
  • Provides uncertainty quantification
  • Enables Bayesian reasoning about cell states

How does it work?

  1. Define a simpler approximate distribution q(z|x) with learnable parameters
  2. Minimize the KL divergence between q(z|x) and true posterior p(z|x)
  3. Equivalent to maximizing the Evidence Lower Bound (ELBO)

ELBO Objective:

ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
       ↑                    ↑
  Reconstruction          Regularization
  • Reconstruction term: Model should generate data similar to observed
  • Regularization term: Latent representation should match prior

Variational Autoencoders (VAEs)

Architecture:

x (observed data)
    ↓
[Encoder Neural Network]
    ↓
z (latent representation)
    ↓
[Decoder Neural Network]
    ↓
x̂ (reconstructed data)

Encoder: Maps cells (x) to latent space (z)

  • Learns q(z|x), the approximate posterior
  • Parameterized by neural network with learnable weights
  • Outputs mean and variance of latent distribution

Decoder: Maps latent space (z) back to gene space

  • Learns p(x|z), the likelihood
  • Generates gene expression from latent representation
  • Models count distributions (Negative Binomial, Zero-Inflated NB)

Reparameterization Trick:

  • Allows backpropagation through stochastic sampling
  • Sample z = μ + σ ⊙ ε, where ε ~ N(0,1)
  • Enables end-to-end training with gradient descent

Amortized Inference

Concept: Share encoder parameters across all cells.

Traditional inference: Learn separate latent variables for each cell

  • n_cells × n_latent parameters
  • Doesn't scale to large datasets

Amortized inference: Learn single encoder for all cells

  • Fixed number of parameters regardless of cell count
  • Enables fast inference on new cells
  • Transfers learned patterns across dataset

Benefits:

  • Scalable to millions of cells
  • Fast inference on query data
  • Leverages shared structure across cells
  • Enables few-shot learning

Statistical Modeling

Count Data Distributions

Single-cell data are counts (integer-valued), requiring appropriate distributions.

Negative Binomial (NB)

x ~ NB(μ, θ)
  • μ (mean): Expected expression level
  • θ (dispersion): Controls variance
  • Variance: Var(x) = μ + μ²/θ

When to use: Gene expression without zero-inflation

  • More flexible than Poisson (allows overdispersion)
  • Models technical and biological variation

Zero-Inflated Negative Binomial (ZINB)

x ~ π·δ₀ + (1-π)·NB(μ, θ)
  • π (dropout rate): Probability of technical zero
  • δ₀: Point mass at zero
  • NB(μ, θ): Expression when not dropped out

When to use: Sparse scRNA-seq data

  • Models technical dropout separately from biological zeros
  • Better fit for highly sparse data (e.g., 10x data)

Poisson

x ~ Poisson(μ)
  • Simplest count distribution
  • Mean equals variance: Var(x) = μ

When to use: Less common; ATAC-seq fragment counts

  • More restrictive than NB
  • Faster computation

Batch Correction Framework

Problem: Technical variation confounds biological signal

  • Different sequencing runs, protocols, labs
  • Must remove technical effects while preserving biology

scvi-tools approach:

  1. Encode batch as categorical variable s
  2. Include s in generative model
  3. Latent space z is batch-invariant
  4. Decoder conditions on s for batch-specific effects

Mathematical formulation:

Encoder: q(z|x, s)  - batch-aware encoding
Latent: z           - batch-corrected representation
Decoder: p(x|z, s)  - batch-specific decoding

Key insight: Batch info flows through decoder, not latent space

  • z captures biological variation
  • s explains technical variation
  • Separable biology and batch effects

Deep Generative Modeling

Generative model: Learns p(x), the data distribution

Process:

  1. Sample latent variable: z ~ p(z) = N(0, I)
  2. Generate expression: x ~ p(x|z)
  3. Joint distribution: p(x, z) = p(x|z)p(z)

Benefits:

  • Generate synthetic cells
  • Impute missing values
  • Quantify uncertainty
  • Perform counterfactual predictions

Inference network: Inverts generative process

  • Given x, infer z
  • q(z|x) approximates true posterior p(z|x)

Model Architecture Details

scVI Architecture

Input: Gene expression counts x ∈ ^G (G genes)

Encoder:

h = ReLU(W₁·x + b₁)
μ_z = W₂·h + b₂
log σ²_z = W₃·h + b₃
z ~ N(μ_z, σ²_z)

Latent space: z ∈ ^d (typically d=10-30)

Decoder:

h = ReLU(W₄·z + b₄)
μ = softmax(W₅·h + b₅) · library_size
θ = exp(W₆·h + b₆)
π = sigmoid(W₇·h + b₇)  # for ZINB
x ~ ZINB(μ, θ, π)

Loss function (ELBO):

L = E_q[log p(x|z)] - KL(q(z|x) || N(0,I))

Handling Covariates

Categorical covariates (batch, donor, etc.):

  • One-hot encoded: s ∈ {0,1}^K
  • Concatenate with latent: [z, s]
  • Or use conditional layers

Continuous covariates (library size, percent_mito):

  • Standardize to zero mean, unit variance
  • Include in encoder and/or decoder

Covariate injection strategies:

  • Concatenation: [z, s] fed to decoder
  • Deep injection: s added at multiple layers
  • Conditional batch norm: Batch-specific normalization

Advanced Theoretical Concepts

Transfer Learning (scArches)

Concept: Use pretrained model as initialization for new data

Process:

  1. Train reference model on large dataset
  2. Freeze encoder parameters
  3. Fine-tune decoder on query data
  4. Or fine-tune all with lower learning rate

Why it works:

  • Encoder learns general cellular representations
  • Decoder adapts to query-specific characteristics
  • Prevents catastrophic forgetting

Applications:

  • Query-to-reference mapping
  • Few-shot learning for rare cell types
  • Rapid analysis of new datasets

Multi-Resolution Modeling (MrVI)

Idea: Separate shared and sample-specific variation

Latent space decomposition:

z = z_shared + z_sample
  • z_shared: Common across samples
  • z_sample: Sample-specific effects

Hierarchical structure:

Sample level: ρ_s ~ N(0, I)
Cell level: z_i ~ N(ρ_{s(i)}, σ²)

Benefits:

  • Disentangle biological sources of variation
  • Compare samples at different resolutions
  • Identify sample-specific cell states

Counterfactual Prediction

Goal: Predict outcome under different conditions

Example: "What would this cell look like if from different batch?"

Method:

  1. Encode cell to latent: z = Encoder(x, s_original)
  2. Decode with new condition: x_new = Decoder(z, s_new)
  3. x_new is counterfactual prediction

Applications:

  • Batch effect assessment
  • Predicting treatment response
  • In silico perturbation studies

Posterior Predictive Distribution

Definition: Distribution of new data given observed data

p(x_new | x_observed) = ∫ p(x_new|z) q(z|x_observed) dz

Estimation: Sample z from q(z|x), generate x_new from p(x_new|z)

Uses:

  • Uncertainty quantification
  • Robust predictions
  • Outlier detection

Differential Expression Framework

Bayesian Approach

Traditional methods: Compare point estimates

  • Wilcoxon, t-test, etc.
  • Ignore uncertainty
  • Require pseudocounts

scvi-tools approach: Compare distributions

  • Sample from posterior: μ_A ~ p(μ|x_A), μ_B ~ p(μ|x_B)
  • Compute log fold-change: LFC = log(μ_B) - log(μ_A)
  • Posterior distribution of LFC quantifies uncertainty

Bayes Factor

Definition: Ratio of posterior odds to prior odds

BF = P(H₁|data) / P(H₀|data)
     ─────────────────────────
     P(H₁) / P(H₀)

Interpretation:

  • BF > 3: Moderate evidence for H₁
  • BF > 10: Strong evidence
  • BF > 100: Decisive evidence

In scvi-tools: Used to rank genes by evidence for DE

False Discovery Proportion (FDP)

Goal: Control expected false discovery rate

Procedure:

  1. For each gene, compute posterior probability of DE
  2. Rank genes by evidence (Bayes factor)
  3. Select top k genes such that E[FDP] ≤ α

Advantage over p-values:

  • Fully Bayesian
  • Natural for posterior inference
  • No arbitrary thresholds

Implementation Details

Optimization

Optimizer: Adam (adaptive learning rates)

  • Default lr = 0.001
  • Momentum parameters: β₁=0.9, β₂=0.999

Training loop:

  1. Sample mini-batch of cells
  2. Compute ELBO loss
  3. Backpropagate gradients
  4. Update parameters with Adam
  5. Repeat until convergence

Convergence criteria:

  • ELBO plateaus on validation set
  • Early stopping prevents overfitting
  • Typically 200-500 epochs

Regularization

KL annealing: Gradually increase KL weight

  • Prevents posterior collapse
  • Starts at 0, increases to 1 over epochs

Dropout: Random neuron dropping during training

  • Default: 0.1 dropout rate
  • Prevents overfitting
  • Improves generalization

Weight decay: L2 regularization on weights

  • Prevents large weights
  • Improves stability

Scalability

Mini-batch training:

  • Process subset of cells per iteration
  • Batch size: 64-256 cells
  • Enables scaling to millions of cells

Stochastic optimization:

  • Estimates ELBO on mini-batches
  • Unbiased gradient estimates
  • Converges to optimal solution

GPU acceleration:

  • Neural networks naturally parallelize
  • Order of magnitude speedup
  • Essential for large datasets

Connections to Other Methods

vs. PCA

  • PCA: Linear, deterministic
  • scVI: Nonlinear, probabilistic
  • Advantage: scVI captures complex structure, handles counts

vs. t-SNE/UMAP

  • t-SNE/UMAP: Visualization-focused
  • scVI: Full generative model
  • Advantage: scVI enables downstream tasks (DE, imputation)

vs. Seurat Integration

  • Seurat: Anchor-based alignment
  • scVI: Probabilistic modeling
  • Advantage: scVI provides uncertainty, works for multiple batches

vs. Harmony

  • Harmony: PCA + batch correction
  • scVI: VAE-based
  • Advantage: scVI handles counts natively, more flexible

Mathematical Notation

Common symbols:

  • x: Observed gene expression (counts)
  • z: Latent representation
  • θ: Model parameters
  • q(z|x): Approximate posterior (encoder)
  • p(x|z): Likelihood (decoder)
  • p(z): Prior on latent variables
  • μ, σ²: Mean and variance
  • π: Dropout probability (ZINB)
  • θ (in NB): Dispersion parameter
  • s: Batch/covariate indicator

Further Reading

Key Papers:

  1. Lopez et al. (2018): "Deep generative modeling for single-cell transcriptomics"
  2. Xu et al. (2021): "Probabilistic harmonization and annotation of single-cell transcriptomics"
  3. Boyeau et al. (2019): "Deep generative models for detecting differential expression in single cells"

Concepts to explore:

  • Variational inference in machine learning
  • Bayesian deep learning
  • Information theory (KL divergence, mutual information)
  • Generative models (GANs, normalizing flows, diffusion models)
  • Probabilistic programming (Pyro, PyTorch)

Mathematical background:

  • Probability theory and statistics
  • Linear algebra and calculus
  • Optimization theory
  • Information theory