10 KiB
Multimodal and Multi-omics Integration Models
This document covers models for joint analysis of multiple data modalities in scvi-tools.
totalVI (Total Variational Inference)
Purpose: Joint analysis of CITE-seq data (simultaneous RNA and protein measurements from same cells).
Key Features:
- Jointly models gene expression and protein abundance
- Learns shared low-dimensional representations
- Enables protein imputation from RNA data
- Performs differential expression for both modalities
- Handles batch effects in both RNA and protein layers
When to Use:
- Analyzing CITE-seq or REAP-seq data
- Joint RNA + surface protein measurements
- Imputing missing proteins
- Integrating protein and RNA information
- Multi-batch CITE-seq integration
Data Requirements:
- AnnData with gene expression in
.Xor a layer - Protein measurements in
.obsm["protein_expression"] - Same cells measured for both modalities
Basic Usage:
import scvi
# Setup data - specify both RNA and protein layers
scvi.model.TOTALVI.setup_anndata(
adata,
layer="counts", # RNA counts
protein_expression_obsm_key="protein_expression", # Protein counts
batch_key="batch"
)
# Train model
model = scvi.model.TOTALVI(adata)
model.train()
# Get joint latent representation
latent = model.get_latent_representation()
# Get normalized values for both modalities
rna_normalized = model.get_normalized_expression()
protein_normalized = model.get_normalized_expression(
transform_batch="batch1",
protein_expression=True
)
# Differential expression (works for both RNA and protein)
rna_de = model.differential_expression(groupby="cell_type")
protein_de = model.differential_expression(
groupby="cell_type",
protein_expression=True
)
Key Parameters:
n_latent: Latent space dimensionality (default: 20)n_layers_encoder: Number of encoder layers (default: 1)n_layers_decoder: Number of decoder layers (default: 1)protein_dispersion: Protein dispersion handling ("protein" or "protein-batch")empirical_protein_background_prior: Use empirical background for proteins
Advanced Features:
Protein Imputation:
# Impute missing proteins for RNA-only cells
# (useful for mapping RNA-seq to CITE-seq reference)
protein_foreground = model.get_protein_foreground_probability()
imputed_proteins = model.get_normalized_expression(
protein_expression=True,
n_samples=25
)
Denoising:
# Get denoised counts for both modalities
denoised_rna = model.get_normalized_expression(n_samples=25)
denoised_protein = model.get_normalized_expression(
protein_expression=True,
n_samples=25
)
Best Practices:
- Use empirical protein background prior for datasets with ambient protein
- Consider protein-specific dispersion for heterogeneous protein data
- Use joint latent space for clustering (better than RNA alone)
- Validate protein imputation with known markers
- Check protein QC metrics before training
MultiVI (Multi-modal Variational Inference)
Purpose: Integration of paired and unpaired multi-omic data (e.g., RNA + ATAC, paired and unpaired cells).
Key Features:
- Handles paired data (same cells) and unpaired data (different cells)
- Integrates multiple modalities: RNA, ATAC, proteins, etc.
- Missing modality imputation
- Learns shared representations across modalities
- Flexible integration strategy
When to Use:
- 10x Multiome data (paired RNA + ATAC)
- Integrating separate RNA-seq and ATAC-seq experiments
- Some cells with both modalities, some with only one
- Cross-modality imputation tasks
Data Requirements:
- AnnData with multiple modalities
- Modality indicators (which measurements each cell has)
- Can handle:
- All cells with both modalities (fully paired)
- Mix of paired and unpaired cells
- Completely unpaired datasets
Basic Usage:
# Prepare data with modality information
# adata.X should contain all features (genes + peaks)
# adata.var["modality"] indicates "Gene" or "Peak"
# adata.obs["modality"] indicates which modality each cell has
scvi.model.MULTIVI.setup_anndata(
adata,
batch_key="batch",
modality_key="modality" # Column indicating cell modality
)
model = scvi.model.MULTIVI(adata)
model.train()
# Get joint latent representation
latent = model.get_latent_representation()
# Impute missing modalities
# E.g., predict ATAC for RNA-only cells
imputed_accessibility = model.get_accessibility_estimates(
indices=rna_only_indices
)
# Get normalized expression/accessibility
rna_normalized = model.get_normalized_expression()
atac_normalized = model.get_accessibility_estimates()
Key Parameters:
n_genes: Number of gene featuresn_regions: Number of accessibility regionsn_latent: Latent dimensionality (default: 20)
Integration Scenarios:
Scenario 1: Fully Paired (10x Multiome):
# All cells have both RNA and ATAC
# Single modality key: "paired"
adata.obs["modality"] = "paired"
Scenario 2: Partially Paired:
# Some cells have both, some RNA-only, some ATAC-only
adata.obs["modality"] = ["RNA+ATAC", "RNA", "ATAC", ...]
Scenario 3: Completely Unpaired:
# Separate RNA and ATAC experiments
adata.obs["modality"] = ["RNA"] * n_rna + ["ATAC"] * n_atac
Advanced Use Cases:
Cross-Modality Prediction:
# Predict peaks from gene expression
accessibility_from_rna = model.get_accessibility_estimates(
indices=rna_only_cells
)
# Predict genes from accessibility
expression_from_atac = model.get_normalized_expression(
indices=atac_only_cells
)
Modality-Specific Analysis:
# Separate analysis per modality
rna_subset = adata[adata.obs["modality"].str.contains("RNA")]
atac_subset = adata[adata.obs["modality"].str.contains("ATAC")]
MrVI (Multi-resolution Variational Inference)
Purpose: Multi-sample analysis accounting for sample-specific and shared variation.
Key Features:
- Simultaneously analyzes multiple samples/conditions
- Decomposes variation into:
- Shared variation (common across samples)
- Sample-specific variation
- Enables sample-level comparisons
- Identifies sample-specific cell states
When to Use:
- Comparing multiple biological samples or conditions
- Identifying sample-specific vs. shared cell states
- Disease vs. healthy sample comparisons
- Understanding inter-sample heterogeneity
- Multi-donor studies
Basic Usage:
scvi.model.MRVI.setup_anndata(
adata,
layer="counts",
batch_key="batch",
sample_key="sample" # Critical: defines biological samples
)
model = scvi.model.MRVI(adata, n_latent=10, n_latent_sample=5)
model.train()
# Get representations
shared_latent = model.get_latent_representation() # Shared across samples
sample_specific = model.get_sample_specific_representation()
# Sample distance matrix
sample_distances = model.get_sample_distances()
Key Parameters:
n_latent: Dimensionality of shared latent spacen_latent_sample: Dimensionality of sample-specific spacesample_key: Column defining biological samples
Analysis Workflow:
# 1. Identify shared cell types across samples
sc.pp.neighbors(adata, use_rep="X_MrVI_shared")
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="shared_clusters")
# 2. Analyze sample-specific variation
sample_repr = model.get_sample_specific_representation()
# 3. Compare samples
distances = model.get_sample_distances()
# 4. Find sample-enriched genes
de_results = model.differential_expression(
groupby="sample",
group1="Disease",
group2="Healthy"
)
Use Cases:
- Multi-donor studies: Separate donor effects from cell type variation
- Disease studies: Identify disease-specific vs. shared biology
- Time series: Separate temporal from stable variation
- Batch + biology: Disentangle technical and biological variation
totalVI vs. MultiVI vs. MrVI: When to Use Which?
totalVI
Use for: CITE-seq (RNA + protein, same cells)
- Paired measurements
- Single modality type per feature
- Focus: protein imputation, joint analysis
MultiVI
Use for: Multiple modalities (RNA + ATAC, etc.)
- Paired, unpaired, or mixed
- Different feature types
- Focus: cross-modality integration and imputation
MrVI
Use for: Multi-sample RNA-seq
- Single modality (RNA)
- Multiple biological samples
- Focus: sample-level variation decomposition
Integration Best Practices
For CITE-seq (totalVI)
- Quality control proteins: Remove low-quality antibodies
- Background subtraction: Use empirical background prior
- Joint clustering: Use joint latent space, not RNA alone
- Validation: Check known markers in both modalities
For Multiome/Multi-modal (MultiVI)
- Feature filtering: Filter genes and peaks independently
- Balance modalities: Ensure reasonable representation of each
- Modality weights: Consider if one modality dominates
- Imputation validation: Validate imputed values carefully
For Multi-sample (MrVI)
- Sample definition: Carefully define biological samples
- Sample size: Need sufficient cells per sample
- Covariate handling: Properly account for batch vs. sample
- Interpretation: Distinguish technical from biological variation
Complete Example: CITE-seq Analysis with totalVI
import scvi
import scanpy as sc
# 1. Load CITE-seq data
adata = sc.read_h5ad("cite_seq.h5ad")
# 2. QC and filtering
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.highly_variable_genes(adata, n_top_genes=4000)
# Protein QC
protein_counts = adata.obsm["protein_expression"]
# Remove low-quality proteins
# 3. Setup totalVI
scvi.model.TOTALVI.setup_anndata(
adata,
layer="counts",
protein_expression_obsm_key="protein_expression",
batch_key="batch"
)
# 4. Train
model = scvi.model.TOTALVI(adata, n_latent=20)
model.train(max_epochs=400)
# 5. Extract joint representation
latent = model.get_latent_representation()
adata.obsm["X_totalVI"] = latent
# 6. Clustering on joint space
sc.pp.neighbors(adata, use_rep="X_totalVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
# 7. Differential expression for both modalities
rna_de = model.differential_expression(
groupby="leiden",
group1="0",
group2="1"
)
protein_de = model.differential_expression(
groupby="leiden",
group1="0",
group2="1",
protein_expression=True
)
# 8. Save model
model.save("totalvi_model")