Files
gh-k-dense-ai-claude-scient…/skills/scvi-tools/references/models-specialized.md
2025-11-30 08:30:10 +08:00

409 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Specialized Modality Models
This document covers models for specialized single-cell data modalities in scvi-tools.
## MethylVI / MethylANVI (Methylation Analysis)
**Purpose**: Analysis of single-cell bisulfite sequencing (scBS-seq) data for DNA methylation.
**Key Features**:
- Models methylation patterns at single-cell resolution
- Handles sparsity in methylation data
- Batch correction for methylation experiments
- Label transfer (MethylANVI) for cell type annotation
**When to Use**:
- Analyzing scBS-seq or similar methylation data
- Studying DNA methylation patterns across cell types
- Integrating methylation data across batches
- Cell type annotation based on methylation profiles
**Data Requirements**:
- Methylation count matrices (methylated vs. total reads per CpG site)
- Format: Cells × CpG sites with methylation ratios or counts
### MethylVI (Unsupervised)
**Basic Usage**:
```python
import scvi
# Setup methylation data
scvi.model.METHYLVI.setup_anndata(
adata,
layer="methylation_counts", # Methylation data
batch_key="batch"
)
model = scvi.model.METHYLVI(adata)
model.train()
# Get latent representation
latent = model.get_latent_representation()
# Get normalized methylation values
normalized_meth = model.get_normalized_methylation()
```
### MethylANVI (Semi-supervised with cell types)
**Basic Usage**:
```python
# Setup with cell type labels
scvi.model.METHYLANVI.setup_anndata(
adata,
layer="methylation_counts",
batch_key="batch",
labels_key="cell_type",
unlabeled_category="Unknown"
)
model = scvi.model.METHYLANVI(adata)
model.train()
# Predict cell types
predictions = model.predict()
```
**Key Parameters**:
- `n_latent`: Latent dimensionality
- `region_factors`: Model region-specific effects
**Use Cases**:
- Epigenetic heterogeneity analysis
- Cell type identification via methylation
- Integration with gene expression data (separate analysis)
- Differential methylation analysis
## CytoVI (Flow and Mass Cytometry)
**Purpose**: Batch correction and integration of flow cytometry and mass cytometry (CyTOF) data.
**Key Features**:
- Handles antibody-based protein measurements
- Corrects batch effects in cytometry data
- Enables integration across experiments
- Designed for high-dimensional protein panels
**When to Use**:
- Analyzing flow cytometry or CyTOF data
- Integrating cytometry experiments across batches
- Batch correction for protein panels
- Cross-study cytometry integration
**Data Requirements**:
- Protein expression matrix (cells × proteins)
- Flow cytometry or CyTOF measurements
- Batch/experiment annotations
**Basic Usage**:
```python
scvi.model.CYTOVI.setup_anndata(
adata,
protein_expression_obsm_key="protein_expression",
batch_key="batch"
)
model = scvi.model.CYTOVI(adata)
model.train()
# Get batch-corrected representation
latent = model.get_latent_representation()
# Get normalized protein values
normalized = model.get_normalized_expression()
```
**Key Parameters**:
- `n_latent`: Latent space dimensionality
- `n_layers`: Network depth
**Typical Workflow**:
```python
import scanpy as sc
# 1. Load cytometry data
adata = sc.read_h5ad("cytof_data.h5ad")
# 2. Train CytoVI
scvi.model.CYTOVI.setup_anndata(
adata,
protein_expression_obsm_key="protein",
batch_key="experiment"
)
model = scvi.model.CYTOVI(adata)
model.train()
# 3. Get batch-corrected values
latent = model.get_latent_representation()
adata.obsm["X_CytoVI"] = latent
# 4. Downstream analysis
sc.pp.neighbors(adata, use_rep="X_CytoVI")
sc.tl.umap(adata)
sc.tl.leiden(adata)
# 5. Visualize batch correction
sc.pl.umap(adata, color=["batch", "leiden"])
```
## SysVI (Systems-level Integration)
**Purpose**: Batch effect correction with emphasis on preserving biological variation.
**Key Features**:
- Specialized batch integration approach
- Preserves biological signals while removing technical effects
- Designed for large-scale integration studies
**When to Use**:
- Large-scale multi-batch integration
- Need to preserve subtle biological variation
- Systems-level analysis across many studies
**Basic Usage**:
```python
scvi.model.SYSVI.setup_anndata(
adata,
layer="counts",
batch_key="batch"
)
model = scvi.model.SYSVI(adata)
model.train()
latent = model.get_latent_representation()
```
## Decipher (Trajectory Inference)
**Purpose**: Trajectory inference and pseudotime analysis for single-cell data.
**Key Features**:
- Learns cellular trajectories and differentiation paths
- Pseudotime estimation
- Accounts for uncertainty in trajectory structure
- Compatible with scVI embeddings
**When to Use**:
- Studying cellular differentiation
- Time-course or developmental datasets
- Understanding cell state transitions
- Identifying branching points in development
**Basic Usage**:
```python
# Typically used after scVI for embeddings
scvi_model = scvi.model.SCVI(adata)
scvi_model.train()
# Decipher for trajectory
scvi.model.DECIPHER.setup_anndata(adata)
decipher_model = scvi.model.DECIPHER(adata, scvi_model)
decipher_model.train()
# Get pseudotime
pseudotime = decipher_model.get_pseudotime()
adata.obs["pseudotime"] = pseudotime
```
**Visualization**:
```python
import scanpy as sc
# Plot pseudotime on UMAP
sc.pl.umap(adata, color="pseudotime", cmap="viridis")
# Gene expression along pseudotime
sc.pl.scatter(adata, x="pseudotime", y="gene_of_interest")
```
## peRegLM (Peak Regulatory Linear Model)
**Purpose**: Linking chromatin accessibility to gene expression for regulatory analysis.
**Key Features**:
- Links ATAC-seq peaks to gene expression
- Identifies regulatory relationships
- Works with paired multiome data
**When to Use**:
- Multiome data (RNA + ATAC from same cells)
- Understanding gene regulation
- Linking peaks to target genes
- Regulatory network construction
**Basic Usage**:
```python
# Requires paired RNA + ATAC data
scvi.model.PEREGLM.setup_anndata(
multiome_adata,
rna_layer="counts",
atac_layer="atac_counts"
)
model = scvi.model.PEREGLM(multiome_adata)
model.train()
# Get peak-gene links
peak_gene_links = model.get_regulatory_links()
```
## Model-Specific Best Practices
### MethylVI/MethylANVI
1. **Sparsity**: Methylation data is inherently sparse; model accounts for this
2. **CpG selection**: Filter CpGs with very low coverage
3. **Biological interpretation**: Consider genomic context (promoters, enhancers)
4. **Integration**: For multi-omics, analyze separately then integrate results
### CytoVI
1. **Protein QC**: Remove low-quality or uninformative proteins
2. **Compensation**: Ensure proper spectral compensation before analysis
3. **Batch design**: Include biological and technical replicates
4. **Controls**: Use control samples to validate batch correction
### SysVI
1. **Sample size**: Designed for large-scale integration
2. **Batch definition**: Carefully define batch structure
3. **Biological validation**: Verify biological signals preserved
### Decipher
1. **Start point**: Define trajectory start cells if known
2. **Branching**: Specify expected number of branches
3. **Validation**: Use known markers to validate pseudotime
4. **Integration**: Works well with scVI embeddings
## Integration with Other Models
Many specialized models work well in combination:
**Methylation + Expression**:
```python
# Analyze separately, then integrate
methylvi_model = scvi.model.METHYLVI(meth_adata)
scvi_model = scvi.model.SCVI(rna_adata)
# Integrate results at analysis level
# E.g., correlate methylation and expression patterns
```
**Cytometry + CITE-seq**:
```python
# CytoVI for flow/CyTOF
cyto_model = scvi.model.CYTOVI(cyto_adata)
# totalVI for CITE-seq
cite_model = scvi.model.TOTALVI(cite_adata)
# Compare protein measurements across platforms
```
**ATAC + RNA (Multiome)**:
```python
# MultiVI for joint analysis
multivi_model = scvi.model.MULTIVI(multiome_adata)
# peRegLM for regulatory links
pereglm_model = scvi.model.PEREGLM(multiome_adata)
```
## Choosing Specialized Models
### Decision Tree
1. **What data modality?**
- Methylation → MethylVI/MethylANVI
- Flow/CyTOF → CytoVI
- Trajectory → Decipher
- Multi-batch integration → SysVI
- Regulatory links → peRegLM
2. **Do you have labels?**
- Yes → MethylANVI (methylation)
- No → MethylVI (methylation)
3. **What's your main goal?**
- Batch correction → CytoVI, SysVI
- Trajectory/pseudotime → Decipher
- Peak-gene links → peRegLM
- Methylation patterns → MethylVI/ANVI
## Example: Complete Methylation Analysis
```python
import scvi
import scanpy as sc
# 1. Load methylation data
meth_adata = sc.read_h5ad("methylation_data.h5ad")
# 2. QC: filter low-coverage CpG sites
sc.pp.filter_genes(meth_adata, min_cells=10)
# 3. Setup MethylVI
scvi.model.METHYLVI.setup_anndata(
meth_adata,
layer="methylation",
batch_key="batch"
)
# 4. Train model
model = scvi.model.METHYLVI(meth_adata, n_latent=15)
model.train(max_epochs=400)
# 5. Get latent representation
latent = model.get_latent_representation()
meth_adata.obsm["X_MethylVI"] = latent
# 6. Clustering
sc.pp.neighbors(meth_adata, use_rep="X_MethylVI")
sc.tl.umap(meth_adata)
sc.tl.leiden(meth_adata)
# 7. Differential methylation
dm_results = model.differential_methylation(
groupby="leiden",
group1="0",
group2="1"
)
# 8. Save
model.save("methylvi_model")
meth_adata.write("methylation_analyzed.h5ad")
```
## External Tools Integration
Some specialized models are available as external packages:
**SOLO** (doublet detection):
```python
from scvi.external import SOLO
solo = SOLO.from_scvi_model(scvi_model)
solo.train()
doublets = solo.predict()
```
**scArches** (reference mapping):
```python
from scvi.external import SCARCHES
# For transfer learning and query-to-reference mapping
```
These external tools extend scvi-tools functionality for specific use cases.
## Summary Table
| Model | Data Type | Primary Use | Supervised? |
|-------|-----------|-------------|-------------|
| MethylVI | Methylation | Unsupervised analysis | No |
| MethylANVI | Methylation | Cell type annotation | Semi |
| CytoVI | Cytometry | Batch correction | No |
| SysVI | scRNA-seq | Large-scale integration | No |
| Decipher | scRNA-seq | Trajectory inference | No |
| peRegLM | Multiome | Peak-gene links | No |
| SOLO | scRNA-seq | Doublet detection | Semi |