Initial commit
This commit is contained in:
330
skills/scvi-tools/references/models-scrna-seq.md
Normal file
330
skills/scvi-tools/references/models-scrna-seq.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# Single-Cell RNA-seq Models
|
||||
|
||||
This document covers core models for analyzing single-cell RNA sequencing data in scvi-tools.
|
||||
|
||||
## scVI (Single-Cell Variational Inference)
|
||||
|
||||
**Purpose**: Unsupervised analysis, dimensionality reduction, and batch correction for scRNA-seq data.
|
||||
|
||||
**Key Features**:
|
||||
- Deep generative model based on variational autoencoders (VAE)
|
||||
- Learns low-dimensional latent representations that capture biological variation
|
||||
- Automatically corrects for batch effects and technical covariates
|
||||
- Enables normalized gene expression estimation
|
||||
- Supports differential expression analysis
|
||||
|
||||
**When to Use**:
|
||||
- Initial exploration and dimensionality reduction of scRNA-seq datasets
|
||||
- Integrating multiple batches or studies
|
||||
- Generating batch-corrected expression matrices
|
||||
- Performing probabilistic differential expression analysis
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# Setup data
|
||||
scvi.model.SCVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
# Train model
|
||||
model = scvi.model.SCVI(adata, n_latent=30)
|
||||
model.train()
|
||||
|
||||
# Extract results
|
||||
latent = model.get_latent_representation()
|
||||
normalized = model.get_normalized_expression()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Dimensionality of latent space (default: 10)
|
||||
- `n_layers`: Number of hidden layers (default: 1)
|
||||
- `n_hidden`: Number of nodes per hidden layer (default: 128)
|
||||
- `dropout_rate`: Dropout rate for neural networks (default: 0.1)
|
||||
- `dispersion`: Gene-specific or cell-specific dispersion ("gene" or "gene-batch")
|
||||
- `gene_likelihood`: Distribution for data ("zinb", "nb", "poisson")
|
||||
|
||||
**Outputs**:
|
||||
- `get_latent_representation()`: Batch-corrected low-dimensional embeddings
|
||||
- `get_normalized_expression()`: Denoised, normalized expression values
|
||||
- `differential_expression()`: Probabilistic DE testing between groups
|
||||
- `get_feature_correlation_matrix()`: Gene-gene correlation estimates
|
||||
|
||||
## scANVI (Single-Cell ANnotation using Variational Inference)
|
||||
|
||||
**Purpose**: Semi-supervised cell type annotation and integration using labeled and unlabeled cells.
|
||||
|
||||
**Key Features**:
|
||||
- Extends scVI with cell type labels
|
||||
- Leverages partially labeled datasets for annotation transfer
|
||||
- Performs simultaneous batch correction and cell type prediction
|
||||
- Enables query-to-reference mapping
|
||||
|
||||
**When to Use**:
|
||||
- Annotating new datasets using reference labels
|
||||
- Transfer learning from well-annotated to unlabeled datasets
|
||||
- Joint analysis of labeled and unlabeled cells
|
||||
- Building cell type classifiers with uncertainty quantification
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Option 1: Train from scratch
|
||||
scvi.model.SCANVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch",
|
||||
labels_key="cell_type",
|
||||
unlabeled_category="Unknown"
|
||||
)
|
||||
model = scvi.model.SCANVI(adata)
|
||||
model.train()
|
||||
|
||||
# Option 2: Initialize from pretrained scVI
|
||||
scvi_model = scvi.model.SCVI(adata)
|
||||
scvi_model.train()
|
||||
scanvi_model = scvi.model.SCANVI.from_scvi_model(
|
||||
scvi_model,
|
||||
unlabeled_category="Unknown"
|
||||
)
|
||||
scanvi_model.train()
|
||||
|
||||
# Predict cell types
|
||||
predictions = scanvi_model.predict()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `labels_key`: Column in `adata.obs` containing cell type labels
|
||||
- `unlabeled_category`: Label for cells without annotations
|
||||
- All scVI parameters are also available
|
||||
|
||||
**Outputs**:
|
||||
- `predict()`: Cell type predictions for all cells
|
||||
- `predict_proba()`: Prediction probabilities
|
||||
- `get_latent_representation()`: Cell type-aware latent space
|
||||
|
||||
## AUTOZI
|
||||
|
||||
**Purpose**: Automatic identification and modeling of zero-inflated genes in scRNA-seq data.
|
||||
|
||||
**Key Features**:
|
||||
- Distinguishes biological zeros from technical dropout
|
||||
- Learns which genes exhibit zero-inflation
|
||||
- Provides gene-specific zero-inflation probabilities
|
||||
- Improves downstream analysis by accounting for dropout
|
||||
|
||||
**When to Use**:
|
||||
- Detecting which genes are affected by technical dropout
|
||||
- Improving imputation and normalization for sparse datasets
|
||||
- Understanding the extent of zero-inflation in your data
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.AUTOZI.setup_anndata(adata, layer="counts")
|
||||
model = scvi.model.AUTOZI(adata)
|
||||
model.train()
|
||||
|
||||
# Get zero-inflation probabilities per gene
|
||||
zi_probs = model.get_alphas_betas()
|
||||
```
|
||||
|
||||
## VeloVI
|
||||
|
||||
**Purpose**: RNA velocity analysis using variational inference.
|
||||
|
||||
**Key Features**:
|
||||
- Joint modeling of spliced and unspliced RNA counts
|
||||
- Probabilistic estimation of RNA velocity
|
||||
- Accounts for technical noise and batch effects
|
||||
- Provides uncertainty quantification for velocity estimates
|
||||
|
||||
**When to Use**:
|
||||
- Inferring cellular dynamics and differentiation trajectories
|
||||
- Analyzing spliced/unspliced count data
|
||||
- RNA velocity analysis with batch correction
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvelo as scv
|
||||
|
||||
# Prepare velocity data
|
||||
scv.pp.filter_and_normalize(adata)
|
||||
scv.pp.moments(adata)
|
||||
|
||||
# Train VeloVI
|
||||
scvi.model.VELOVI.setup_anndata(adata, spliced_layer="Ms", unspliced_layer="Mu")
|
||||
model = scvi.model.VELOVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get velocity estimates
|
||||
latent_time = model.get_latent_time()
|
||||
velocities = model.get_velocity()
|
||||
```
|
||||
|
||||
## contrastiveVI
|
||||
|
||||
**Purpose**: Isolating perturbation-specific variations from background biological variation.
|
||||
|
||||
**Key Features**:
|
||||
- Separates shared variation (common across conditions) from target-specific variation
|
||||
- Useful for perturbation studies (drug treatments, genetic perturbations)
|
||||
- Identifies condition-specific gene programs
|
||||
- Enables discovery of treatment-specific effects
|
||||
|
||||
**When to Use**:
|
||||
- Analyzing perturbation experiments (drug screens, CRISPR, etc.)
|
||||
- Identifying genes responding specifically to treatments
|
||||
- Separating treatment effects from background variation
|
||||
- Comparing control vs. perturbed conditions
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.CONTRASTIVEVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch",
|
||||
categorical_covariate_keys=["condition"] # control vs treated
|
||||
)
|
||||
|
||||
model = scvi.model.CONTRASTIVEVI(
|
||||
adata,
|
||||
n_latent=10, # Shared variation
|
||||
n_latent_target=5 # Target-specific variation
|
||||
)
|
||||
model.train()
|
||||
|
||||
# Extract representations
|
||||
shared = model.get_latent_representation(representation="shared")
|
||||
target_specific = model.get_latent_representation(representation="target")
|
||||
```
|
||||
|
||||
## CellAssign
|
||||
|
||||
**Purpose**: Marker-based cell type annotation using known marker genes.
|
||||
|
||||
**Key Features**:
|
||||
- Uses prior knowledge of marker genes for cell types
|
||||
- Probabilistic assignment of cells to types
|
||||
- Handles marker gene overlap and ambiguity
|
||||
- Provides soft assignments with uncertainty
|
||||
|
||||
**When to Use**:
|
||||
- Annotating cells using known marker genes
|
||||
- Leveraging existing biological knowledge for classification
|
||||
- Cases where marker gene lists are available but reference datasets are not
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Create marker gene matrix (cell types x genes)
|
||||
marker_gene_mat = pd.DataFrame({
|
||||
"CD4 T cells": [1, 1, 0, 0], # CD3D, CD4, CD8A, CD19
|
||||
"CD8 T cells": [1, 0, 1, 0],
|
||||
"B cells": [0, 0, 0, 1]
|
||||
}, index=["CD3D", "CD4", "CD8A", "CD19"])
|
||||
|
||||
scvi.model.CELLASSIGN.setup_anndata(adata, layer="counts")
|
||||
model = scvi.model.CELLASSIGN(adata, marker_gene_mat)
|
||||
model.train()
|
||||
|
||||
predictions = model.predict()
|
||||
```
|
||||
|
||||
## Solo (Doublet Detection)
|
||||
|
||||
**Purpose**: Identifying doublets (cells containing two or more cells) in scRNA-seq data.
|
||||
|
||||
**Key Features**:
|
||||
- Semi-supervised doublet detection using scVI embeddings
|
||||
- Simulates artificial doublets for training
|
||||
- Provides doublet probability scores
|
||||
- Can be applied to any scVI model
|
||||
|
||||
**When to Use**:
|
||||
- Quality control of scRNA-seq datasets
|
||||
- Removing doublets before downstream analysis
|
||||
- Assessing doublet rates in your data
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Train scVI model first
|
||||
scvi.model.SCVI.setup_anndata(adata, layer="counts")
|
||||
scvi_model = scvi.model.SCVI(adata)
|
||||
scvi_model.train()
|
||||
|
||||
# Train Solo for doublet detection
|
||||
solo_model = scvi.external.SOLO.from_scvi_model(scvi_model)
|
||||
solo_model.train()
|
||||
|
||||
# Predict doublets
|
||||
predictions = solo_model.predict()
|
||||
doublet_scores = predictions["doublet"]
|
||||
adata.obs["doublet_score"] = doublet_scores
|
||||
```
|
||||
|
||||
## Amortized LDA (Topic Modeling)
|
||||
|
||||
**Purpose**: Topic modeling for gene expression using Latent Dirichlet Allocation.
|
||||
|
||||
**Key Features**:
|
||||
- Discovers gene expression programs (topics)
|
||||
- Amortized variational inference for scalability
|
||||
- Each cell is a mixture of topics
|
||||
- Each topic is a distribution over genes
|
||||
|
||||
**When to Use**:
|
||||
- Discovering gene programs or expression modules
|
||||
- Understanding compositional structure of expression
|
||||
- Alternative dimensionality reduction approach
|
||||
- Interpretable decomposition of expression patterns
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.AMORTIZEDLDA.setup_anndata(adata, layer="counts")
|
||||
model = scvi.model.AMORTIZEDLDA(adata, n_topics=10)
|
||||
model.train()
|
||||
|
||||
# Get topic compositions per cell
|
||||
topic_proportions = model.get_latent_representation()
|
||||
|
||||
# Get gene loadings per topic
|
||||
topic_gene_loadings = model.get_topic_distribution()
|
||||
```
|
||||
|
||||
## Model Selection Guidelines
|
||||
|
||||
**Choose scVI when**:
|
||||
- Starting with unsupervised analysis
|
||||
- Need batch correction and integration
|
||||
- Want normalized expression and DE analysis
|
||||
|
||||
**Choose scANVI when**:
|
||||
- Have some labeled cells for training
|
||||
- Need cell type annotation
|
||||
- Want to transfer labels from reference to query
|
||||
|
||||
**Choose AUTOZI when**:
|
||||
- Concerned about technical dropout
|
||||
- Need to identify zero-inflated genes
|
||||
- Working with very sparse datasets
|
||||
|
||||
**Choose VeloVI when**:
|
||||
- Have spliced/unspliced count data
|
||||
- Interested in cellular dynamics
|
||||
- Need RNA velocity with batch correction
|
||||
|
||||
**Choose contrastiveVI when**:
|
||||
- Analyzing perturbation experiments
|
||||
- Need to separate treatment effects
|
||||
- Want to identify condition-specific programs
|
||||
|
||||
**Choose CellAssign when**:
|
||||
- Have marker gene lists available
|
||||
- Want probabilistic marker-based annotation
|
||||
- No reference dataset available
|
||||
|
||||
**Choose Solo when**:
|
||||
- Need doublet detection
|
||||
- Already using scVI for analysis
|
||||
- Want probabilistic doublet scores
|
||||
Reference in New Issue
Block a user