Files
gh-k-dense-ai-claude-scient…/skills/scvi-tools/references/models-specialized.md
2025-11-30 08:30:10 +08:00

10 KiB
Raw Blame History

Specialized Modality Models

This document covers models for specialized single-cell data modalities in scvi-tools.

MethylVI / MethylANVI (Methylation Analysis)

Purpose: Analysis of single-cell bisulfite sequencing (scBS-seq) data for DNA methylation.

Key Features:

  • Models methylation patterns at single-cell resolution
  • Handles sparsity in methylation data
  • Batch correction for methylation experiments
  • Label transfer (MethylANVI) for cell type annotation

When to Use:

  • Analyzing scBS-seq or similar methylation data
  • Studying DNA methylation patterns across cell types
  • Integrating methylation data across batches
  • Cell type annotation based on methylation profiles

Data Requirements:

  • Methylation count matrices (methylated vs. total reads per CpG site)
  • Format: Cells × CpG sites with methylation ratios or counts

MethylVI (Unsupervised)

Basic Usage:

import scvi

# Setup methylation data
scvi.model.METHYLVI.setup_anndata(
    adata,
    layer="methylation_counts",  # Methylation data
    batch_key="batch"
)

model = scvi.model.METHYLVI(adata)
model.train()

# Get latent representation
latent = model.get_latent_representation()

# Get normalized methylation values
normalized_meth = model.get_normalized_methylation()

MethylANVI (Semi-supervised with cell types)

Basic Usage:

# Setup with cell type labels
scvi.model.METHYLANVI.setup_anndata(
    adata,
    layer="methylation_counts",
    batch_key="batch",
    labels_key="cell_type",
    unlabeled_category="Unknown"
)

model = scvi.model.METHYLANVI(adata)
model.train()

# Predict cell types
predictions = model.predict()

Key Parameters:

  • n_latent: Latent dimensionality
  • region_factors: Model region-specific effects

Use Cases:

  • Epigenetic heterogeneity analysis
  • Cell type identification via methylation
  • Integration with gene expression data (separate analysis)
  • Differential methylation analysis

CytoVI (Flow and Mass Cytometry)

Purpose: Batch correction and integration of flow cytometry and mass cytometry (CyTOF) data.

Key Features:

  • Handles antibody-based protein measurements
  • Corrects batch effects in cytometry data
  • Enables integration across experiments
  • Designed for high-dimensional protein panels

When to Use:

  • Analyzing flow cytometry or CyTOF data
  • Integrating cytometry experiments across batches
  • Batch correction for protein panels
  • Cross-study cytometry integration

Data Requirements:

  • Protein expression matrix (cells × proteins)
  • Flow cytometry or CyTOF measurements
  • Batch/experiment annotations

Basic Usage:

scvi.model.CYTOVI.setup_anndata(
    adata,
    protein_expression_obsm_key="protein_expression",
    batch_key="batch"
)

model = scvi.model.CYTOVI(adata)
model.train()

# Get batch-corrected representation
latent = model.get_latent_representation()

# Get normalized protein values
normalized = model.get_normalized_expression()

Key Parameters:

  • n_latent: Latent space dimensionality
  • n_layers: Network depth

Typical Workflow:

import scanpy as sc

# 1. Load cytometry data
adata = sc.read_h5ad("cytof_data.h5ad")

# 2. Train CytoVI
scvi.model.CYTOVI.setup_anndata(
    adata,
    protein_expression_obsm_key="protein",
    batch_key="experiment"
)
model = scvi.model.CYTOVI(adata)
model.train()

# 3. Get batch-corrected values
latent = model.get_latent_representation()
adata.obsm["X_CytoVI"] = latent

# 4. Downstream analysis
sc.pp.neighbors(adata, use_rep="X_CytoVI")
sc.tl.umap(adata)
sc.tl.leiden(adata)

# 5. Visualize batch correction
sc.pl.umap(adata, color=["batch", "leiden"])

SysVI (Systems-level Integration)

Purpose: Batch effect correction with emphasis on preserving biological variation.

Key Features:

  • Specialized batch integration approach
  • Preserves biological signals while removing technical effects
  • Designed for large-scale integration studies

When to Use:

  • Large-scale multi-batch integration
  • Need to preserve subtle biological variation
  • Systems-level analysis across many studies

Basic Usage:

scvi.model.SYSVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch"
)

model = scvi.model.SYSVI(adata)
model.train()

latent = model.get_latent_representation()

Decipher (Trajectory Inference)

Purpose: Trajectory inference and pseudotime analysis for single-cell data.

Key Features:

  • Learns cellular trajectories and differentiation paths
  • Pseudotime estimation
  • Accounts for uncertainty in trajectory structure
  • Compatible with scVI embeddings

When to Use:

  • Studying cellular differentiation
  • Time-course or developmental datasets
  • Understanding cell state transitions
  • Identifying branching points in development

Basic Usage:

# Typically used after scVI for embeddings
scvi_model = scvi.model.SCVI(adata)
scvi_model.train()

# Decipher for trajectory
scvi.model.DECIPHER.setup_anndata(adata)
decipher_model = scvi.model.DECIPHER(adata, scvi_model)
decipher_model.train()

# Get pseudotime
pseudotime = decipher_model.get_pseudotime()
adata.obs["pseudotime"] = pseudotime

Visualization:

import scanpy as sc

# Plot pseudotime on UMAP
sc.pl.umap(adata, color="pseudotime", cmap="viridis")

# Gene expression along pseudotime
sc.pl.scatter(adata, x="pseudotime", y="gene_of_interest")

peRegLM (Peak Regulatory Linear Model)

Purpose: Linking chromatin accessibility to gene expression for regulatory analysis.

Key Features:

  • Links ATAC-seq peaks to gene expression
  • Identifies regulatory relationships
  • Works with paired multiome data

When to Use:

  • Multiome data (RNA + ATAC from same cells)
  • Understanding gene regulation
  • Linking peaks to target genes
  • Regulatory network construction

Basic Usage:

# Requires paired RNA + ATAC data
scvi.model.PEREGLM.setup_anndata(
    multiome_adata,
    rna_layer="counts",
    atac_layer="atac_counts"
)

model = scvi.model.PEREGLM(multiome_adata)
model.train()

# Get peak-gene links
peak_gene_links = model.get_regulatory_links()

Model-Specific Best Practices

MethylVI/MethylANVI

  1. Sparsity: Methylation data is inherently sparse; model accounts for this
  2. CpG selection: Filter CpGs with very low coverage
  3. Biological interpretation: Consider genomic context (promoters, enhancers)
  4. Integration: For multi-omics, analyze separately then integrate results

CytoVI

  1. Protein QC: Remove low-quality or uninformative proteins
  2. Compensation: Ensure proper spectral compensation before analysis
  3. Batch design: Include biological and technical replicates
  4. Controls: Use control samples to validate batch correction

SysVI

  1. Sample size: Designed for large-scale integration
  2. Batch definition: Carefully define batch structure
  3. Biological validation: Verify biological signals preserved

Decipher

  1. Start point: Define trajectory start cells if known
  2. Branching: Specify expected number of branches
  3. Validation: Use known markers to validate pseudotime
  4. Integration: Works well with scVI embeddings

Integration with Other Models

Many specialized models work well in combination:

Methylation + Expression:

# Analyze separately, then integrate
methylvi_model = scvi.model.METHYLVI(meth_adata)
scvi_model = scvi.model.SCVI(rna_adata)

# Integrate results at analysis level
# E.g., correlate methylation and expression patterns

Cytometry + CITE-seq:

# CytoVI for flow/CyTOF
cyto_model = scvi.model.CYTOVI(cyto_adata)

# totalVI for CITE-seq
cite_model = scvi.model.TOTALVI(cite_adata)

# Compare protein measurements across platforms

ATAC + RNA (Multiome):

# MultiVI for joint analysis
multivi_model = scvi.model.MULTIVI(multiome_adata)

# peRegLM for regulatory links
pereglm_model = scvi.model.PEREGLM(multiome_adata)

Choosing Specialized Models

Decision Tree

  1. What data modality?

    • Methylation → MethylVI/MethylANVI
    • Flow/CyTOF → CytoVI
    • Trajectory → Decipher
    • Multi-batch integration → SysVI
    • Regulatory links → peRegLM
  2. Do you have labels?

    • Yes → MethylANVI (methylation)
    • No → MethylVI (methylation)
  3. What's your main goal?

    • Batch correction → CytoVI, SysVI
    • Trajectory/pseudotime → Decipher
    • Peak-gene links → peRegLM
    • Methylation patterns → MethylVI/ANVI

Example: Complete Methylation Analysis

import scvi
import scanpy as sc

# 1. Load methylation data
meth_adata = sc.read_h5ad("methylation_data.h5ad")

# 2. QC: filter low-coverage CpG sites
sc.pp.filter_genes(meth_adata, min_cells=10)

# 3. Setup MethylVI
scvi.model.METHYLVI.setup_anndata(
    meth_adata,
    layer="methylation",
    batch_key="batch"
)

# 4. Train model
model = scvi.model.METHYLVI(meth_adata, n_latent=15)
model.train(max_epochs=400)

# 5. Get latent representation
latent = model.get_latent_representation()
meth_adata.obsm["X_MethylVI"] = latent

# 6. Clustering
sc.pp.neighbors(meth_adata, use_rep="X_MethylVI")
sc.tl.umap(meth_adata)
sc.tl.leiden(meth_adata)

# 7. Differential methylation
dm_results = model.differential_methylation(
    groupby="leiden",
    group1="0",
    group2="1"
)

# 8. Save
model.save("methylvi_model")
meth_adata.write("methylation_analyzed.h5ad")

External Tools Integration

Some specialized models are available as external packages:

SOLO (doublet detection):

from scvi.external import SOLO

solo = SOLO.from_scvi_model(scvi_model)
solo.train()
doublets = solo.predict()

scArches (reference mapping):

from scvi.external import SCARCHES

# For transfer learning and query-to-reference mapping

These external tools extend scvi-tools functionality for specific use cases.

Summary Table

Model Data Type Primary Use Supervised?
MethylVI Methylation Unsupervised analysis No
MethylANVI Methylation Cell type annotation Semi
CytoVI Cytometry Batch correction No
SysVI scRNA-seq Large-scale integration No
Decipher scRNA-seq Trajectory inference No
peRegLM Multiome Peak-gene links No
SOLO scRNA-seq Doublet detection Semi