Initial commit
This commit is contained in:
184
skills/scvi-tools/SKILL.md
Normal file
184
skills/scvi-tools/SKILL.md
Normal file
@@ -0,0 +1,184 @@
|
||||
---
|
||||
name: scvi-tools
|
||||
description: This skill should be used when working with single-cell omics data analysis using scvi-tools, including scRNA-seq, scATAC-seq, CITE-seq, spatial transcriptomics, and other single-cell modalities. Use this skill for probabilistic modeling, batch correction, dimensionality reduction, differential expression, cell type annotation, multimodal integration, and spatial analysis tasks.
|
||||
---
|
||||
|
||||
# scvi-tools
|
||||
|
||||
## Overview
|
||||
|
||||
scvi-tools is a comprehensive Python framework for probabilistic models in single-cell genomics. Built on PyTorch and PyTorch Lightning, it provides deep generative models using variational inference for analyzing diverse single-cell data modalities.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Analyzing single-cell RNA-seq data (dimensionality reduction, batch correction, integration)
|
||||
- Working with single-cell ATAC-seq or chromatin accessibility data
|
||||
- Integrating multimodal data (CITE-seq, multiome, paired/unpaired datasets)
|
||||
- Analyzing spatial transcriptomics data (deconvolution, spatial mapping)
|
||||
- Performing differential expression analysis on single-cell data
|
||||
- Conducting cell type annotation or transfer learning tasks
|
||||
- Working with specialized single-cell modalities (methylation, cytometry, RNA velocity)
|
||||
- Building custom probabilistic models for single-cell analysis
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
scvi-tools provides models organized by data modality:
|
||||
|
||||
### 1. Single-Cell RNA-seq Analysis
|
||||
Core models for expression analysis, batch correction, and integration. See `references/models-scrna-seq.md` for:
|
||||
- **scVI**: Unsupervised dimensionality reduction and batch correction
|
||||
- **scANVI**: Semi-supervised cell type annotation and integration
|
||||
- **AUTOZI**: Zero-inflation detection and modeling
|
||||
- **VeloVI**: RNA velocity analysis
|
||||
- **contrastiveVI**: Perturbation effect isolation
|
||||
|
||||
### 2. Chromatin Accessibility (ATAC-seq)
|
||||
Models for analyzing single-cell chromatin data. See `references/models-atac-seq.md` for:
|
||||
- **PeakVI**: Peak-based ATAC-seq analysis and integration
|
||||
- **PoissonVI**: Quantitative fragment count modeling
|
||||
- **scBasset**: Deep learning approach with motif analysis
|
||||
|
||||
### 3. Multimodal & Multi-omics Integration
|
||||
Joint analysis of multiple data types. See `references/models-multimodal.md` for:
|
||||
- **totalVI**: CITE-seq protein and RNA joint modeling
|
||||
- **MultiVI**: Paired and unpaired multi-omic integration
|
||||
- **MrVI**: Multi-resolution cross-sample analysis
|
||||
|
||||
### 4. Spatial Transcriptomics
|
||||
Spatially-resolved transcriptomics analysis. See `references/models-spatial.md` for:
|
||||
- **DestVI**: Multi-resolution spatial deconvolution
|
||||
- **Stereoscope**: Cell type deconvolution
|
||||
- **Tangram**: Spatial mapping and integration
|
||||
- **scVIVA**: Cell-environment relationship analysis
|
||||
|
||||
### 5. Specialized Modalities
|
||||
Additional specialized analysis tools. See `references/models-specialized.md` for:
|
||||
- **MethylVI/MethylANVI**: Single-cell methylation analysis
|
||||
- **CytoVI**: Flow/mass cytometry batch correction
|
||||
- **Solo**: Doublet detection
|
||||
- **CellAssign**: Marker-based cell type annotation
|
||||
|
||||
## Typical Workflow
|
||||
|
||||
All scvi-tools models follow a consistent API pattern:
|
||||
|
||||
```python
|
||||
# 1. Load and preprocess data (AnnData format)
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
|
||||
adata = scvi.data.heart_cell_atlas_subsampled()
|
||||
sc.pp.filter_genes(adata, min_counts=3)
|
||||
sc.pp.highly_variable_genes(adata, n_top_genes=1200)
|
||||
|
||||
# 2. Register data with model (specify layers, covariates)
|
||||
scvi.model.SCVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts", # Use raw counts, not log-normalized
|
||||
batch_key="batch",
|
||||
categorical_covariate_keys=["donor"],
|
||||
continuous_covariate_keys=["percent_mito"]
|
||||
)
|
||||
|
||||
# 3. Create and train model
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# 4. Extract latent representations and normalized values
|
||||
latent = model.get_latent_representation()
|
||||
normalized = model.get_normalized_expression(library_size=1e4)
|
||||
|
||||
# 5. Store in AnnData for downstream analysis
|
||||
adata.obsm["X_scVI"] = latent
|
||||
adata.layers["scvi_normalized"] = normalized
|
||||
|
||||
# 6. Downstream analysis with scanpy
|
||||
sc.pp.neighbors(adata, use_rep="X_scVI")
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata)
|
||||
```
|
||||
|
||||
**Key Design Principles:**
|
||||
- **Raw counts required**: Models expect unnormalized count data for optimal performance
|
||||
- **Unified API**: Consistent interface across all models (setup → train → extract)
|
||||
- **AnnData-centric**: Seamless integration with the scanpy ecosystem
|
||||
- **GPU acceleration**: Automatic utilization of available GPUs
|
||||
- **Batch correction**: Handle technical variation through covariate registration
|
||||
|
||||
## Common Analysis Tasks
|
||||
|
||||
### Differential Expression
|
||||
Probabilistic DE analysis using the learned generative models:
|
||||
|
||||
```python
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="TypeA",
|
||||
group2="TypeB",
|
||||
mode="change", # Use composite hypothesis testing
|
||||
delta=0.25 # Minimum effect size threshold
|
||||
)
|
||||
```
|
||||
|
||||
See `references/differential-expression.md` for detailed methodology and interpretation.
|
||||
|
||||
### Model Persistence
|
||||
Save and load trained models:
|
||||
|
||||
```python
|
||||
# Save model
|
||||
model.save("./model_directory", overwrite=True)
|
||||
|
||||
# Load model
|
||||
model = scvi.model.SCVI.load("./model_directory", adata=adata)
|
||||
```
|
||||
|
||||
### Batch Correction and Integration
|
||||
Integrate datasets across batches or studies:
|
||||
|
||||
```python
|
||||
# Register batch information
|
||||
scvi.model.SCVI.setup_anndata(adata, batch_key="study")
|
||||
|
||||
# Model automatically learns batch-corrected representations
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
latent = model.get_latent_representation() # Batch-corrected
|
||||
```
|
||||
|
||||
## Theoretical Foundations
|
||||
|
||||
scvi-tools is built on:
|
||||
- **Variational inference**: Approximate posterior distributions for scalable Bayesian inference
|
||||
- **Deep generative models**: VAE architectures that learn complex data distributions
|
||||
- **Amortized inference**: Shared neural networks for efficient learning across cells
|
||||
- **Probabilistic modeling**: Principled uncertainty quantification and statistical testing
|
||||
|
||||
See `references/theoretical-foundations.md` for detailed background on the mathematical framework.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Workflows**: `references/workflows.md` contains common workflows, best practices, hyperparameter tuning, and GPU optimization
|
||||
- **Model References**: Detailed documentation for each model category in the `references/` directory
|
||||
- **Official Documentation**: https://docs.scvi-tools.org/en/stable/
|
||||
- **Tutorials**: https://docs.scvi-tools.org/en/stable/tutorials/index.html
|
||||
- **API Reference**: https://docs.scvi-tools.org/en/stable/api/index.html
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
uv pip install scvi-tools
|
||||
# For GPU support
|
||||
uv pip install scvi-tools[cuda]
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use raw counts**: Always provide unnormalized count data to models
|
||||
2. **Filter genes**: Remove low-count genes before analysis (e.g., `min_counts=3`)
|
||||
3. **Register covariates**: Include known technical factors (batch, donor, etc.) in `setup_anndata`
|
||||
4. **Feature selection**: Use highly variable genes for improved performance
|
||||
5. **Model saving**: Always save trained models to avoid retraining
|
||||
6. **GPU usage**: Enable GPU acceleration for large datasets (`accelerator="gpu"`)
|
||||
7. **Scanpy integration**: Store outputs in AnnData objects for downstream analysis
|
||||
581
skills/scvi-tools/references/differential-expression.md
Normal file
581
skills/scvi-tools/references/differential-expression.md
Normal file
@@ -0,0 +1,581 @@
|
||||
# Differential Expression Analysis in scvi-tools
|
||||
|
||||
This document provides detailed information about differential expression (DE) analysis using scvi-tools' probabilistic framework.
|
||||
|
||||
## Overview
|
||||
|
||||
scvi-tools implements Bayesian differential expression testing that leverages the learned generative models to estimate expression differences between groups. This approach provides several advantages over traditional methods:
|
||||
|
||||
- **Batch correction**: DE testing on batch-corrected representations
|
||||
- **Uncertainty quantification**: Probabilistic estimates of effect sizes
|
||||
- **Zero-inflation handling**: Proper modeling of dropout and zeros
|
||||
- **Flexible comparisons**: Between any groups or cell types
|
||||
- **Multiple modalities**: Works for RNA, proteins (totalVI), and accessibility (PeakVI)
|
||||
|
||||
## Core Statistical Framework
|
||||
|
||||
### Problem Definition
|
||||
|
||||
The goal is to estimate the log fold-change in expression between two conditions:
|
||||
|
||||
```
|
||||
log fold-change = log(μ_B) - log(μ_A)
|
||||
```
|
||||
|
||||
Where μ_A and μ_B are the mean expression levels in conditions A and B.
|
||||
|
||||
### Three-Stage Process
|
||||
|
||||
**Stage 1: Estimating Expression Levels**
|
||||
- Sample from posterior distribution of cellular states
|
||||
- Generate expression values from the learned generative model
|
||||
- Aggregate across cells to get population-level estimates
|
||||
|
||||
**Stage 2: Detecting Relevant Features (Hypothesis Testing)**
|
||||
- Test for differential expression using Bayesian framework
|
||||
- Two testing modes available:
|
||||
- **"vanilla" mode**: Point null hypothesis (β = 0)
|
||||
- **"change" mode**: Composite hypothesis (|β| ≤ δ)
|
||||
|
||||
**Stage 3: Controlling False Discovery**
|
||||
- Posterior expected False Discovery Proportion (FDP) control
|
||||
- Selects maximum number of discoveries ensuring E[FDP] ≤ α
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Simple Two-Group Comparison
|
||||
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# After training a model
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# Compare two cell types
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells"
|
||||
)
|
||||
|
||||
# View top DE genes
|
||||
top_genes = de_results.sort_values("lfc_mean", ascending=False).head(20)
|
||||
print(top_genes[["lfc_mean", "lfc_std", "bayes_factor", "is_de_fdr_0.05"]])
|
||||
```
|
||||
|
||||
### One vs. Rest Comparison
|
||||
|
||||
```python
|
||||
# Compare one group against all others
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells" # No group2 = compare to rest
|
||||
)
|
||||
```
|
||||
|
||||
### All Pairwise Comparisons
|
||||
|
||||
```python
|
||||
# Compare all cell types pairwise
|
||||
all_comparisons = {}
|
||||
|
||||
cell_types = adata.obs["cell_type"].unique()
|
||||
|
||||
for ct1 in cell_types:
|
||||
for ct2 in cell_types:
|
||||
if ct1 != ct2:
|
||||
key = f"{ct1}_vs_{ct2}"
|
||||
all_comparisons[key] = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1=ct1,
|
||||
group2=ct2
|
||||
)
|
||||
```
|
||||
|
||||
## Key Parameters
|
||||
|
||||
### `groupby` (required)
|
||||
Column in `adata.obs` defining groups to compare.
|
||||
|
||||
```python
|
||||
# Must be a categorical variable
|
||||
de_results = model.differential_expression(groupby="cell_type")
|
||||
```
|
||||
|
||||
### `group1` and `group2`
|
||||
Groups to compare. If `group2` is None, compares `group1` to all others.
|
||||
|
||||
```python
|
||||
# Specific comparison
|
||||
de = model.differential_expression(groupby="condition", group1="treated", group2="control")
|
||||
|
||||
# One vs rest
|
||||
de = model.differential_expression(groupby="cell_type", group1="T cells")
|
||||
```
|
||||
|
||||
### `mode` (Hypothesis Testing Mode)
|
||||
|
||||
**"vanilla" mode** (default): Point null hypothesis
|
||||
- Tests if β = 0 exactly
|
||||
- More sensitive, but may find trivially small effects
|
||||
|
||||
**"change" mode**: Composite null hypothesis
|
||||
- Tests if |β| ≤ δ
|
||||
- Requires biologically meaningful change
|
||||
- Reduces false discoveries of tiny effects
|
||||
|
||||
```python
|
||||
# Change mode with minimum effect size
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
mode="change",
|
||||
delta=0.25 # Minimum log fold-change
|
||||
)
|
||||
```
|
||||
|
||||
### `delta`
|
||||
Minimum effect size threshold for "change" mode.
|
||||
- Typical values: 0.25, 0.5, 0.7 (log scale)
|
||||
- log2(1.5) ≈ 0.58 (1.5-fold change)
|
||||
- log2(2) = 1.0 (2-fold change)
|
||||
|
||||
```python
|
||||
# Require at least 1.5-fold change
|
||||
de = model.differential_expression(
|
||||
groupby="condition",
|
||||
group1="disease",
|
||||
group2="healthy",
|
||||
mode="change",
|
||||
delta=0.58 # log2(1.5)
|
||||
)
|
||||
```
|
||||
|
||||
### `fdr_target`
|
||||
False discovery rate threshold (default: 0.05)
|
||||
|
||||
```python
|
||||
# More stringent FDR control
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
fdr_target=0.01
|
||||
)
|
||||
```
|
||||
|
||||
### `batch_correction`
|
||||
Whether to perform batch correction during DE testing (default: True)
|
||||
|
||||
```python
|
||||
# Test within a specific batch
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
batch_correction=False
|
||||
)
|
||||
```
|
||||
|
||||
### `n_samples`
|
||||
Number of posterior samples for estimation (default: 5000)
|
||||
- More samples = more accurate but slower
|
||||
- Reduce for speed, increase for precision
|
||||
|
||||
```python
|
||||
# High precision analysis
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
n_samples=10000
|
||||
)
|
||||
```
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Output Columns
|
||||
|
||||
The results DataFrame contains several important columns:
|
||||
|
||||
**Effect Size Estimates**:
|
||||
- `lfc_mean`: Mean log fold-change
|
||||
- `lfc_median`: Median log fold-change
|
||||
- `lfc_std`: Standard deviation of log fold-change
|
||||
- `lfc_min`: Lower bound of effect size
|
||||
- `lfc_max`: Upper bound of effect size
|
||||
|
||||
**Statistical Significance**:
|
||||
- `bayes_factor`: Bayes factor for differential expression
|
||||
- Higher values = stronger evidence
|
||||
- >3 often considered meaningful
|
||||
- `is_de_fdr_0.05`: Boolean indicating if gene is DE at FDR 0.05
|
||||
- `is_de_fdr_0.1`: Boolean indicating if gene is DE at FDR 0.1
|
||||
|
||||
**Expression Levels**:
|
||||
- `mean1`: Mean expression in group 1
|
||||
- `mean2`: Mean expression in group 2
|
||||
- `non_zeros_proportion1`: Proportion of non-zero cells in group 1
|
||||
- `non_zeros_proportion2`: Proportion of non-zero cells in group 2
|
||||
|
||||
### Example Interpretation
|
||||
|
||||
```python
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells"
|
||||
)
|
||||
|
||||
# Find significantly upregulated genes in T cells
|
||||
upreg_tcells = de_results[
|
||||
(de_results["is_de_fdr_0.05"]) &
|
||||
(de_results["lfc_mean"] > 0)
|
||||
].sort_values("lfc_mean", ascending=False)
|
||||
|
||||
print(f"Upregulated genes in T cells: {len(upreg_tcells)}")
|
||||
print(upreg_tcells.head(10))
|
||||
|
||||
# Find genes with large effect sizes
|
||||
large_effect = de_results[
|
||||
(de_results["is_de_fdr_0.05"]) &
|
||||
(abs(de_results["lfc_mean"]) > 1) # 2-fold change
|
||||
]
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### DE Within Specific Cells
|
||||
|
||||
```python
|
||||
# Test DE only within a subset of cells
|
||||
subset_indices = adata.obs["tissue"] == "lung"
|
||||
|
||||
de = model.differential_expression(
|
||||
idx1=adata.obs["cell_type"] == "T cells" & subset_indices,
|
||||
idx2=adata.obs["cell_type"] == "B cells" & subset_indices
|
||||
)
|
||||
```
|
||||
|
||||
### Batch-Specific DE
|
||||
|
||||
```python
|
||||
# Test DE within each batch separately
|
||||
batches = adata.obs["batch"].unique()
|
||||
|
||||
batch_de_results = {}
|
||||
for batch in batches:
|
||||
batch_idx = adata.obs["batch"] == batch
|
||||
batch_de_results[batch] = model.differential_expression(
|
||||
idx1=(adata.obs["condition"] == "treated") & batch_idx,
|
||||
idx2=(adata.obs["condition"] == "control") & batch_idx
|
||||
)
|
||||
```
|
||||
|
||||
### Pseudo-bulk DE
|
||||
|
||||
```python
|
||||
# Aggregate cells before DE testing
|
||||
# Useful for low cell counts per group
|
||||
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="rare_cell_type",
|
||||
group2="common_cell_type",
|
||||
n_samples=10000, # More samples for stability
|
||||
batch_correction=True
|
||||
)
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Volcano Plot
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
de = model.differential_expression(
|
||||
groupby="condition",
|
||||
group1="treated",
|
||||
group2="control"
|
||||
)
|
||||
|
||||
# Volcano plot
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.scatter(
|
||||
de["lfc_mean"],
|
||||
-np.log10(1 / (de["bayes_factor"] + 1)),
|
||||
c=de["is_de_fdr_0.05"],
|
||||
cmap="coolwarm",
|
||||
alpha=0.5
|
||||
)
|
||||
plt.xlabel("Log Fold Change")
|
||||
plt.ylabel("-log10(1/Bayes Factor)")
|
||||
plt.title("Volcano Plot: Treated vs Control")
|
||||
plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Heatmap of Top DE Genes
|
||||
|
||||
```python
|
||||
import seaborn as sns
|
||||
|
||||
# Get top DE genes
|
||||
top_genes = de.sort_values("lfc_mean", ascending=False).head(50).index
|
||||
|
||||
# Get normalized expression
|
||||
norm_expr = model.get_normalized_expression(
|
||||
adata,
|
||||
indices=adata.obs["condition"].isin(["treated", "control"]),
|
||||
gene_list=top_genes
|
||||
)
|
||||
|
||||
# Plot heatmap
|
||||
plt.figure(figsize=(12, 10))
|
||||
sns.heatmap(
|
||||
norm_expr.T,
|
||||
cmap="viridis",
|
||||
xticklabels=False,
|
||||
yticklabels=top_genes
|
||||
)
|
||||
plt.title("Top 50 DE Genes")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Ranked Gene Plot
|
||||
|
||||
```python
|
||||
# Plot genes ranked by effect size
|
||||
de_sorted = de.sort_values("lfc_mean", ascending=False)
|
||||
|
||||
plt.figure(figsize=(12, 6))
|
||||
plt.plot(range(len(de_sorted)), de_sorted["lfc_mean"].values)
|
||||
plt.axhline(y=0, color='r', linestyle='--')
|
||||
plt.xlabel("Gene Rank")
|
||||
plt.ylabel("Log Fold Change")
|
||||
plt.title("Genes Ranked by Effect Size")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Comparison with Traditional Methods
|
||||
|
||||
### scvi-tools vs. Wilcoxon Test
|
||||
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Traditional Wilcoxon test
|
||||
sc.tl.rank_genes_groups(
|
||||
adata,
|
||||
groupby="cell_type",
|
||||
method="wilcoxon",
|
||||
key_added="wilcoxon"
|
||||
)
|
||||
|
||||
# scvi-tools DE
|
||||
de_scvi = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells"
|
||||
)
|
||||
|
||||
# Compare results
|
||||
wilcox_results = sc.get.rank_genes_groups_df(adata, group="T cells", key="wilcoxon")
|
||||
```
|
||||
|
||||
**Advantages of scvi-tools**:
|
||||
- Accounts for batch effects automatically
|
||||
- Handles zero-inflation properly
|
||||
- Provides uncertainty quantification
|
||||
- No arbitrary pseudocount needed
|
||||
- Better statistical properties
|
||||
|
||||
**When to use Wilcoxon**:
|
||||
- Very quick exploratory analysis
|
||||
- Comparison with published results using Wilcoxon
|
||||
|
||||
## Multi-Modal DE
|
||||
|
||||
### Protein DE (totalVI)
|
||||
|
||||
```python
|
||||
# Train totalVI on CITE-seq data
|
||||
totalvi_model = scvi.model.TOTALVI(adata)
|
||||
totalvi_model.train()
|
||||
|
||||
# RNA differential expression
|
||||
rna_de = totalvi_model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
protein_expression=False # Default
|
||||
)
|
||||
|
||||
# Protein differential expression
|
||||
protein_de = totalvi_model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
protein_expression=True
|
||||
)
|
||||
|
||||
print(f"DE genes: {rna_de['is_de_fdr_0.05'].sum()}")
|
||||
print(f"DE proteins: {protein_de['is_de_fdr_0.05'].sum()}")
|
||||
```
|
||||
|
||||
### Differential Accessibility (PeakVI)
|
||||
|
||||
```python
|
||||
# Train PeakVI on ATAC-seq data
|
||||
peakvi_model = scvi.model.PEAKVI(atac_adata)
|
||||
peakvi_model.train()
|
||||
|
||||
# Differential accessibility
|
||||
da = peakvi_model.differential_accessibility(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells"
|
||||
)
|
||||
|
||||
# Same interpretation as DE
|
||||
```
|
||||
|
||||
## Handling Special Cases
|
||||
|
||||
### Low Cell Count Groups
|
||||
|
||||
```python
|
||||
# Increase posterior samples for stability
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="rare_type", # e.g., 50 cells
|
||||
group2="common_type", # e.g., 5000 cells
|
||||
n_samples=10000
|
||||
)
|
||||
```
|
||||
|
||||
### Imbalanced Comparisons
|
||||
|
||||
```python
|
||||
# When groups have very different sizes
|
||||
# Use change mode to avoid tiny effects
|
||||
|
||||
de = model.differential_expression(
|
||||
groupby="condition",
|
||||
group1="rare_condition",
|
||||
group2="common_condition",
|
||||
mode="change",
|
||||
delta=0.5
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple Testing Correction
|
||||
|
||||
```python
|
||||
# Already included via FDP control
|
||||
# But can apply additional corrections
|
||||
|
||||
from statsmodels.stats.multitest import multipletests
|
||||
|
||||
# Bonferroni correction (very conservative)
|
||||
_, pvals_corrected, _, _ = multipletests(
|
||||
1 / (de["bayes_factor"] + 1),
|
||||
method="bonferroni"
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Speed Optimization
|
||||
|
||||
```python
|
||||
# Faster DE testing for large datasets
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
n_samples=1000, # Reduce samples
|
||||
batch_size=512 # Increase batch size
|
||||
)
|
||||
```
|
||||
|
||||
### Memory Management
|
||||
|
||||
```python
|
||||
# For very large datasets
|
||||
# Test one comparison at a time rather than all pairwise
|
||||
|
||||
cell_types = adata.obs["cell_type"].unique()
|
||||
for ct in cell_types:
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1=ct
|
||||
)
|
||||
# Save results
|
||||
de.to_csv(f"de_results_{ct}.csv")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use "change" mode**: For biologically interpretable results
|
||||
2. **Set appropriate delta**: Based on biological significance
|
||||
3. **Check expression levels**: Filter lowly expressed genes
|
||||
4. **Validate findings**: Check marker genes for sanity
|
||||
5. **Visualize results**: Always plot top DE genes
|
||||
6. **Report parameters**: Document mode, delta, FDR used
|
||||
7. **Consider batch effects**: Use batch_correction=True
|
||||
8. **Multiple comparisons**: Be aware of testing many groups
|
||||
9. **Sample size**: Ensure sufficient cells per group (>50 recommended)
|
||||
10. **Biological validation**: Follow up with functional experiments
|
||||
|
||||
## Example: Complete DE Analysis Workflow
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# 1. Train model
|
||||
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# 2. Perform DE analysis
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="Disease_T_cells",
|
||||
group2="Healthy_T_cells",
|
||||
mode="change",
|
||||
delta=0.5,
|
||||
fdr_target=0.05
|
||||
)
|
||||
|
||||
# 3. Filter and analyze
|
||||
sig_genes = de_results[de_results["is_de_fdr_0.05"]]
|
||||
upreg = sig_genes[sig_genes["lfc_mean"] > 0].sort_values("lfc_mean", ascending=False)
|
||||
downreg = sig_genes[sig_genes["lfc_mean"] < 0].sort_values("lfc_mean")
|
||||
|
||||
print(f"Significant genes: {len(sig_genes)}")
|
||||
print(f"Upregulated: {len(upreg)}")
|
||||
print(f"Downregulated: {len(downreg)}")
|
||||
|
||||
# 4. Visualize top genes
|
||||
top_genes = upreg.head(10).index.tolist() + downreg.head(10).index.tolist()
|
||||
|
||||
sc.pl.violin(
|
||||
adata[adata.obs["cell_type"].isin(["Disease_T_cells", "Healthy_T_cells"])],
|
||||
keys=top_genes,
|
||||
groupby="cell_type",
|
||||
rotation=90
|
||||
)
|
||||
|
||||
# 5. Functional enrichment (using external tools)
|
||||
# E.g., g:Profiler, DAVID, or gprofiler-official Python package
|
||||
upreg_genes = upreg.head(100).index.tolist()
|
||||
# Perform pathway analysis...
|
||||
|
||||
# 6. Save results
|
||||
de_results.to_csv("de_results_disease_vs_healthy.csv")
|
||||
upreg.to_csv("upregulated_genes.csv")
|
||||
downreg.to_csv("downregulated_genes.csv")
|
||||
```
|
||||
321
skills/scvi-tools/references/models-atac-seq.md
Normal file
321
skills/scvi-tools/references/models-atac-seq.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# ATAC-seq and Chromatin Accessibility Models
|
||||
|
||||
This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.
|
||||
|
||||
## PeakVI
|
||||
|
||||
**Purpose**: Analysis and integration of single-cell ATAC-seq data using peak counts.
|
||||
|
||||
**Key Features**:
|
||||
- Variational autoencoder specifically designed for scATAC-seq peak data
|
||||
- Learns low-dimensional representations of chromatin accessibility
|
||||
- Performs batch correction across samples
|
||||
- Enables differential accessibility testing
|
||||
- Integrates multiple ATAC-seq datasets
|
||||
|
||||
**When to Use**:
|
||||
- Analyzing scATAC-seq peak count matrices
|
||||
- Integrating multiple ATAC-seq experiments
|
||||
- Batch correction of chromatin accessibility data
|
||||
- Dimensionality reduction for ATAC-seq
|
||||
- Differential accessibility analysis between cell types or conditions
|
||||
|
||||
**Data Requirements**:
|
||||
- Peak count matrix (cells × peaks)
|
||||
- Binary or count data for peak accessibility
|
||||
- Batch/sample annotations (optional, for batch correction)
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# Prepare data (peaks should be in adata.X)
|
||||
# Optional: filter peaks
|
||||
sc.pp.filter_genes(adata, min_cells=3)
|
||||
|
||||
# Setup data
|
||||
scvi.model.PEAKVI.setup_anndata(
|
||||
adata,
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
# Train model
|
||||
model = scvi.model.PEAKVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get latent representation (batch-corrected)
|
||||
latent = model.get_latent_representation()
|
||||
adata.obsm["X_PeakVI"] = latent
|
||||
|
||||
# Differential accessibility
|
||||
da_results = model.differential_accessibility(
|
||||
groupby="cell_type",
|
||||
group1="TypeA",
|
||||
group2="TypeB"
|
||||
)
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Dimensionality of latent space (default: 10)
|
||||
- `n_hidden`: Number of nodes per hidden layer (default: 128)
|
||||
- `n_layers`: Number of hidden layers (default: 1)
|
||||
- `region_factors`: Whether to learn region-specific factors (default: True)
|
||||
- `latent_distribution`: Distribution for latent space ("normal" or "ln")
|
||||
|
||||
**Outputs**:
|
||||
- `get_latent_representation()`: Low-dimensional embeddings for cells
|
||||
- `get_accessibility_estimates()`: Normalized accessibility values
|
||||
- `differential_accessibility()`: Statistical testing for differential peaks
|
||||
- `get_region_factors()`: Peak-specific scaling factors
|
||||
|
||||
**Best Practices**:
|
||||
1. Filter out low-quality peaks (present in very few cells)
|
||||
2. Include batch information if integrating multiple samples
|
||||
3. Use latent representations for clustering and UMAP visualization
|
||||
4. Consider using `region_factors=True` for datasets with high technical variation
|
||||
5. Store latent embeddings in `adata.obsm` for downstream analysis with scanpy
|
||||
|
||||
## PoissonVI
|
||||
|
||||
**Purpose**: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).
|
||||
|
||||
**Key Features**:
|
||||
- Models fragment counts directly (not just peak presence/absence)
|
||||
- Poisson distribution for count data
|
||||
- Captures quantitative differences in accessibility
|
||||
- Enables fine-grained analysis of chromatin state
|
||||
|
||||
**When to Use**:
|
||||
- Analyzing fragment-level ATAC-seq data
|
||||
- Need quantitative accessibility measurements
|
||||
- Higher resolution analysis than binary peak calls
|
||||
- Investigating gradual changes in chromatin accessibility
|
||||
|
||||
**Data Requirements**:
|
||||
- Fragment count matrix (cells × genomic regions)
|
||||
- Count data (not binary)
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.POISSONVI.setup_anndata(
|
||||
adata,
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
model = scvi.model.POISSONVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get results
|
||||
latent = model.get_latent_representation()
|
||||
accessibility = model.get_accessibility_estimates()
|
||||
```
|
||||
|
||||
**Key Differences from PeakVI**:
|
||||
- **PeakVI**: Best for standard peak count matrices, faster
|
||||
- **PoissonVI**: Best for quantitative fragment counts, more detailed
|
||||
|
||||
**When to Choose PoissonVI over PeakVI**:
|
||||
- Working with fragment counts rather than called peaks
|
||||
- Need to capture quantitative differences
|
||||
- Have high-quality, high-coverage data
|
||||
- Interested in subtle accessibility changes
|
||||
|
||||
## scBasset
|
||||
|
||||
**Purpose**: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.
|
||||
|
||||
**Key Features**:
|
||||
- Convolutional neural network (CNN) architecture for sequence-based analysis
|
||||
- Models raw DNA sequences, not just peak counts
|
||||
- Enables motif discovery and transcription factor (TF) binding prediction
|
||||
- Provides interpretable feature importance
|
||||
- Performs batch correction
|
||||
|
||||
**When to Use**:
|
||||
- Want to incorporate DNA sequence information
|
||||
- Interested in TF motif analysis
|
||||
- Need interpretable models (which sequences drive accessibility)
|
||||
- Analyzing regulatory elements and TF binding sites
|
||||
- Predicting accessibility from sequence alone
|
||||
|
||||
**Data Requirements**:
|
||||
- Peak sequences (extracted from genome)
|
||||
- Peak accessibility matrix
|
||||
- Genome reference (for sequence extraction)
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# scBasset requires sequence information
|
||||
# First, extract sequences for peaks
|
||||
from scbasset import utils
|
||||
sequences = utils.fetch_sequences(adata, genome="hg38")
|
||||
|
||||
# Setup and train
|
||||
scvi.model.SCBASSET.setup_anndata(
|
||||
adata,
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
model = scvi.model.SCBASSET(adata, sequences=sequences)
|
||||
model.train()
|
||||
|
||||
# Get latent representation
|
||||
latent = model.get_latent_representation()
|
||||
|
||||
# Interpret model: which sequences/motifs are important
|
||||
importance_scores = model.get_feature_importance()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Latent space dimensionality
|
||||
- `conv_layers`: Number of convolutional layers
|
||||
- `n_filters`: Number of filters per conv layer
|
||||
- `filter_size`: Size of convolutional filters
|
||||
|
||||
**Advanced Features**:
|
||||
- **In silico mutagenesis**: Predict how sequence changes affect accessibility
|
||||
- **Motif enrichment**: Identify enriched TF motifs in accessible regions
|
||||
- **Batch correction**: Similar to other scvi-tools models
|
||||
- **Transfer learning**: Fine-tune on new datasets
|
||||
|
||||
**Interpretability Tools**:
|
||||
```python
|
||||
# Get importance scores for sequences
|
||||
importance = model.get_sequence_importance(region_indices=[0, 1, 2])
|
||||
|
||||
# Predict accessibility for new sequences
|
||||
predictions = model.predict_accessibility(new_sequences)
|
||||
```
|
||||
|
||||
## Model Selection for ATAC-seq
|
||||
|
||||
### PeakVI
|
||||
**Choose when**:
|
||||
- Standard scATAC-seq analysis workflow
|
||||
- Have peak count matrices (most common format)
|
||||
- Need fast, efficient batch correction
|
||||
- Want straightforward differential accessibility
|
||||
- Prioritize computational efficiency
|
||||
|
||||
**Advantages**:
|
||||
- Fast training and inference
|
||||
- Proven track record for scATAC-seq
|
||||
- Easy integration with scanpy workflow
|
||||
- Robust batch correction
|
||||
|
||||
### PoissonVI
|
||||
**Choose when**:
|
||||
- Have fragment-level count data
|
||||
- Need quantitative accessibility measures
|
||||
- Interested in subtle differences
|
||||
- Have high-coverage, high-quality data
|
||||
|
||||
**Advantages**:
|
||||
- More detailed quantitative information
|
||||
- Better for gradient changes
|
||||
- Appropriate statistical model for counts
|
||||
|
||||
### scBasset
|
||||
**Choose when**:
|
||||
- Want to incorporate DNA sequence
|
||||
- Need biological interpretation (motifs, TFs)
|
||||
- Interested in regulatory mechanisms
|
||||
- Have computational resources for CNN training
|
||||
- Want predictive power for new sequences
|
||||
|
||||
**Advantages**:
|
||||
- Sequence-based, biologically interpretable
|
||||
- Motif and TF analysis built-in
|
||||
- Predictive modeling capabilities
|
||||
- In silico perturbation experiments
|
||||
|
||||
## Workflow Example: Complete ATAC-seq Analysis
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
|
||||
# 1. Load and preprocess ATAC-seq data
|
||||
adata = sc.read_h5ad("atac_data.h5ad")
|
||||
|
||||
# 2. Filter low-quality peaks
|
||||
sc.pp.filter_genes(adata, min_cells=10)
|
||||
|
||||
# 3. Setup and train PeakVI
|
||||
scvi.model.PEAKVI.setup_anndata(
|
||||
adata,
|
||||
batch_key="sample"
|
||||
)
|
||||
|
||||
model = scvi.model.PEAKVI(adata, n_latent=20)
|
||||
model.train(max_epochs=400)
|
||||
|
||||
# 4. Extract latent representation
|
||||
latent = model.get_latent_representation()
|
||||
adata.obsm["X_PeakVI"] = latent
|
||||
|
||||
# 5. Downstream analysis
|
||||
sc.pp.neighbors(adata, use_rep="X_PeakVI")
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata, key_added="clusters")
|
||||
|
||||
# 6. Differential accessibility
|
||||
da_results = model.differential_accessibility(
|
||||
groupby="clusters",
|
||||
group1="0",
|
||||
group2="1"
|
||||
)
|
||||
|
||||
# 7. Save model
|
||||
model.save("peakvi_model")
|
||||
```
|
||||
|
||||
## Integration with Gene Expression (RNA+ATAC)
|
||||
|
||||
For paired multimodal data (RNA+ATAC from same cells), use **MultiVI** instead:
|
||||
|
||||
```python
|
||||
# For 10x Multiome or similar paired data
|
||||
scvi.model.MULTIVI.setup_anndata(
|
||||
adata,
|
||||
batch_key="sample",
|
||||
modality_key="modality" # "RNA" or "ATAC"
|
||||
)
|
||||
|
||||
model = scvi.model.MULTIVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get joint latent space
|
||||
latent = model.get_latent_representation()
|
||||
```
|
||||
|
||||
See `models-multimodal.md` for more details on multimodal integration.
|
||||
|
||||
## Best Practices for ATAC-seq Analysis
|
||||
|
||||
1. **Quality Control**:
|
||||
- Filter cells with very low or very high peak counts
|
||||
- Remove peaks present in very few cells
|
||||
- Filter mitochondrial and sex chromosome peaks if needed
|
||||
|
||||
2. **Batch Correction**:
|
||||
- Always include `batch_key` if integrating multiple samples
|
||||
- Consider technical covariates (sequencing depth, TSS enrichment)
|
||||
|
||||
3. **Feature Selection**:
|
||||
- Unlike RNA-seq, all peaks are often used
|
||||
- Consider filtering very rare peaks for efficiency
|
||||
|
||||
4. **Latent Dimensions**:
|
||||
- Start with `n_latent=10-30` depending on dataset complexity
|
||||
- Larger values for more heterogeneous datasets
|
||||
|
||||
5. **Downstream Analysis**:
|
||||
- Use latent representations for clustering and visualization
|
||||
- Link peaks to genes for regulatory analysis
|
||||
- Perform motif enrichment on cluster-specific peaks
|
||||
|
||||
6. **Computational Considerations**:
|
||||
- ATAC-seq matrices are often very large (many peaks)
|
||||
- Consider downsampling peaks for initial exploration
|
||||
- Use GPU acceleration for large datasets
|
||||
367
skills/scvi-tools/references/models-multimodal.md
Normal file
367
skills/scvi-tools/references/models-multimodal.md
Normal file
@@ -0,0 +1,367 @@
|
||||
# Multimodal and Multi-omics Integration Models
|
||||
|
||||
This document covers models for joint analysis of multiple data modalities in scvi-tools.
|
||||
|
||||
## totalVI (Total Variational Inference)
|
||||
|
||||
**Purpose**: Joint analysis of CITE-seq data (simultaneous RNA and protein measurements from same cells).
|
||||
|
||||
**Key Features**:
|
||||
- Jointly models gene expression and protein abundance
|
||||
- Learns shared low-dimensional representations
|
||||
- Enables protein imputation from RNA data
|
||||
- Performs differential expression for both modalities
|
||||
- Handles batch effects in both RNA and protein layers
|
||||
|
||||
**When to Use**:
|
||||
- Analyzing CITE-seq or REAP-seq data
|
||||
- Joint RNA + surface protein measurements
|
||||
- Imputing missing proteins
|
||||
- Integrating protein and RNA information
|
||||
- Multi-batch CITE-seq integration
|
||||
|
||||
**Data Requirements**:
|
||||
- AnnData with gene expression in `.X` or a layer
|
||||
- Protein measurements in `.obsm["protein_expression"]`
|
||||
- Same cells measured for both modalities
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# Setup data - specify both RNA and protein layers
|
||||
scvi.model.TOTALVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts", # RNA counts
|
||||
protein_expression_obsm_key="protein_expression", # Protein counts
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
# Train model
|
||||
model = scvi.model.TOTALVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get joint latent representation
|
||||
latent = model.get_latent_representation()
|
||||
|
||||
# Get normalized values for both modalities
|
||||
rna_normalized = model.get_normalized_expression()
|
||||
protein_normalized = model.get_normalized_expression(
|
||||
transform_batch="batch1",
|
||||
protein_expression=True
|
||||
)
|
||||
|
||||
# Differential expression (works for both RNA and protein)
|
||||
rna_de = model.differential_expression(groupby="cell_type")
|
||||
protein_de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
protein_expression=True
|
||||
)
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Latent space dimensionality (default: 20)
|
||||
- `n_layers_encoder`: Number of encoder layers (default: 1)
|
||||
- `n_layers_decoder`: Number of decoder layers (default: 1)
|
||||
- `protein_dispersion`: Protein dispersion handling ("protein" or "protein-batch")
|
||||
- `empirical_protein_background_prior`: Use empirical background for proteins
|
||||
|
||||
**Advanced Features**:
|
||||
|
||||
**Protein Imputation**:
|
||||
```python
|
||||
# Impute missing proteins for RNA-only cells
|
||||
# (useful for mapping RNA-seq to CITE-seq reference)
|
||||
protein_foreground = model.get_protein_foreground_probability()
|
||||
imputed_proteins = model.get_normalized_expression(
|
||||
protein_expression=True,
|
||||
n_samples=25
|
||||
)
|
||||
```
|
||||
|
||||
**Denoising**:
|
||||
```python
|
||||
# Get denoised counts for both modalities
|
||||
denoised_rna = model.get_normalized_expression(n_samples=25)
|
||||
denoised_protein = model.get_normalized_expression(
|
||||
protein_expression=True,
|
||||
n_samples=25
|
||||
)
|
||||
```
|
||||
|
||||
**Best Practices**:
|
||||
1. Use empirical protein background prior for datasets with ambient protein
|
||||
2. Consider protein-specific dispersion for heterogeneous protein data
|
||||
3. Use joint latent space for clustering (better than RNA alone)
|
||||
4. Validate protein imputation with known markers
|
||||
5. Check protein QC metrics before training
|
||||
|
||||
## MultiVI (Multi-modal Variational Inference)
|
||||
|
||||
**Purpose**: Integration of paired and unpaired multi-omic data (e.g., RNA + ATAC, paired and unpaired cells).
|
||||
|
||||
**Key Features**:
|
||||
- Handles paired data (same cells) and unpaired data (different cells)
|
||||
- Integrates multiple modalities: RNA, ATAC, proteins, etc.
|
||||
- Missing modality imputation
|
||||
- Learns shared representations across modalities
|
||||
- Flexible integration strategy
|
||||
|
||||
**When to Use**:
|
||||
- 10x Multiome data (paired RNA + ATAC)
|
||||
- Integrating separate RNA-seq and ATAC-seq experiments
|
||||
- Some cells with both modalities, some with only one
|
||||
- Cross-modality imputation tasks
|
||||
|
||||
**Data Requirements**:
|
||||
- AnnData with multiple modalities
|
||||
- Modality indicators (which measurements each cell has)
|
||||
- Can handle:
|
||||
- All cells with both modalities (fully paired)
|
||||
- Mix of paired and unpaired cells
|
||||
- Completely unpaired datasets
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Prepare data with modality information
|
||||
# adata.X should contain all features (genes + peaks)
|
||||
# adata.var["modality"] indicates "Gene" or "Peak"
|
||||
# adata.obs["modality"] indicates which modality each cell has
|
||||
|
||||
scvi.model.MULTIVI.setup_anndata(
|
||||
adata,
|
||||
batch_key="batch",
|
||||
modality_key="modality" # Column indicating cell modality
|
||||
)
|
||||
|
||||
model = scvi.model.MULTIVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get joint latent representation
|
||||
latent = model.get_latent_representation()
|
||||
|
||||
# Impute missing modalities
|
||||
# E.g., predict ATAC for RNA-only cells
|
||||
imputed_accessibility = model.get_accessibility_estimates(
|
||||
indices=rna_only_indices
|
||||
)
|
||||
|
||||
# Get normalized expression/accessibility
|
||||
rna_normalized = model.get_normalized_expression()
|
||||
atac_normalized = model.get_accessibility_estimates()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_genes`: Number of gene features
|
||||
- `n_regions`: Number of accessibility regions
|
||||
- `n_latent`: Latent dimensionality (default: 20)
|
||||
|
||||
**Integration Scenarios**:
|
||||
|
||||
**Scenario 1: Fully Paired (10x Multiome)**:
|
||||
```python
|
||||
# All cells have both RNA and ATAC
|
||||
# Single modality key: "paired"
|
||||
adata.obs["modality"] = "paired"
|
||||
```
|
||||
|
||||
**Scenario 2: Partially Paired**:
|
||||
```python
|
||||
# Some cells have both, some RNA-only, some ATAC-only
|
||||
adata.obs["modality"] = ["RNA+ATAC", "RNA", "ATAC", ...]
|
||||
```
|
||||
|
||||
**Scenario 3: Completely Unpaired**:
|
||||
```python
|
||||
# Separate RNA and ATAC experiments
|
||||
adata.obs["modality"] = ["RNA"] * n_rna + ["ATAC"] * n_atac
|
||||
```
|
||||
|
||||
**Advanced Use Cases**:
|
||||
|
||||
**Cross-Modality Prediction**:
|
||||
```python
|
||||
# Predict peaks from gene expression
|
||||
accessibility_from_rna = model.get_accessibility_estimates(
|
||||
indices=rna_only_cells
|
||||
)
|
||||
|
||||
# Predict genes from accessibility
|
||||
expression_from_atac = model.get_normalized_expression(
|
||||
indices=atac_only_cells
|
||||
)
|
||||
```
|
||||
|
||||
**Modality-Specific Analysis**:
|
||||
```python
|
||||
# Separate analysis per modality
|
||||
rna_subset = adata[adata.obs["modality"].str.contains("RNA")]
|
||||
atac_subset = adata[adata.obs["modality"].str.contains("ATAC")]
|
||||
```
|
||||
|
||||
## MrVI (Multi-resolution Variational Inference)
|
||||
|
||||
**Purpose**: Multi-sample analysis accounting for sample-specific and shared variation.
|
||||
|
||||
**Key Features**:
|
||||
- Simultaneously analyzes multiple samples/conditions
|
||||
- Decomposes variation into:
|
||||
- Shared variation (common across samples)
|
||||
- Sample-specific variation
|
||||
- Enables sample-level comparisons
|
||||
- Identifies sample-specific cell states
|
||||
|
||||
**When to Use**:
|
||||
- Comparing multiple biological samples or conditions
|
||||
- Identifying sample-specific vs. shared cell states
|
||||
- Disease vs. healthy sample comparisons
|
||||
- Understanding inter-sample heterogeneity
|
||||
- Multi-donor studies
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.MRVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch",
|
||||
sample_key="sample" # Critical: defines biological samples
|
||||
)
|
||||
|
||||
model = scvi.model.MRVI(adata, n_latent=10, n_latent_sample=5)
|
||||
model.train()
|
||||
|
||||
# Get representations
|
||||
shared_latent = model.get_latent_representation() # Shared across samples
|
||||
sample_specific = model.get_sample_specific_representation()
|
||||
|
||||
# Sample distance matrix
|
||||
sample_distances = model.get_sample_distances()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Dimensionality of shared latent space
|
||||
- `n_latent_sample`: Dimensionality of sample-specific space
|
||||
- `sample_key`: Column defining biological samples
|
||||
|
||||
**Analysis Workflow**:
|
||||
```python
|
||||
# 1. Identify shared cell types across samples
|
||||
sc.pp.neighbors(adata, use_rep="X_MrVI_shared")
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata, key_added="shared_clusters")
|
||||
|
||||
# 2. Analyze sample-specific variation
|
||||
sample_repr = model.get_sample_specific_representation()
|
||||
|
||||
# 3. Compare samples
|
||||
distances = model.get_sample_distances()
|
||||
|
||||
# 4. Find sample-enriched genes
|
||||
de_results = model.differential_expression(
|
||||
groupby="sample",
|
||||
group1="Disease",
|
||||
group2="Healthy"
|
||||
)
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
- **Multi-donor studies**: Separate donor effects from cell type variation
|
||||
- **Disease studies**: Identify disease-specific vs. shared biology
|
||||
- **Time series**: Separate temporal from stable variation
|
||||
- **Batch + biology**: Disentangle technical and biological variation
|
||||
|
||||
## totalVI vs. MultiVI vs. MrVI: When to Use Which?
|
||||
|
||||
### totalVI
|
||||
**Use for**: CITE-seq (RNA + protein, same cells)
|
||||
- Paired measurements
|
||||
- Single modality type per feature
|
||||
- Focus: protein imputation, joint analysis
|
||||
|
||||
### MultiVI
|
||||
**Use for**: Multiple modalities (RNA + ATAC, etc.)
|
||||
- Paired, unpaired, or mixed
|
||||
- Different feature types
|
||||
- Focus: cross-modality integration and imputation
|
||||
|
||||
### MrVI
|
||||
**Use for**: Multi-sample RNA-seq
|
||||
- Single modality (RNA)
|
||||
- Multiple biological samples
|
||||
- Focus: sample-level variation decomposition
|
||||
|
||||
## Integration Best Practices
|
||||
|
||||
### For CITE-seq (totalVI)
|
||||
1. **Quality control proteins**: Remove low-quality antibodies
|
||||
2. **Background subtraction**: Use empirical background prior
|
||||
3. **Joint clustering**: Use joint latent space, not RNA alone
|
||||
4. **Validation**: Check known markers in both modalities
|
||||
|
||||
### For Multiome/Multi-modal (MultiVI)
|
||||
1. **Feature filtering**: Filter genes and peaks independently
|
||||
2. **Balance modalities**: Ensure reasonable representation of each
|
||||
3. **Modality weights**: Consider if one modality dominates
|
||||
4. **Imputation validation**: Validate imputed values carefully
|
||||
|
||||
### For Multi-sample (MrVI)
|
||||
1. **Sample definition**: Carefully define biological samples
|
||||
2. **Sample size**: Need sufficient cells per sample
|
||||
3. **Covariate handling**: Properly account for batch vs. sample
|
||||
4. **Interpretation**: Distinguish technical from biological variation
|
||||
|
||||
## Complete Example: CITE-seq Analysis with totalVI
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
|
||||
# 1. Load CITE-seq data
|
||||
adata = sc.read_h5ad("cite_seq.h5ad")
|
||||
|
||||
# 2. QC and filtering
|
||||
sc.pp.filter_genes(adata, min_cells=3)
|
||||
sc.pp.highly_variable_genes(adata, n_top_genes=4000)
|
||||
|
||||
# Protein QC
|
||||
protein_counts = adata.obsm["protein_expression"]
|
||||
# Remove low-quality proteins
|
||||
|
||||
# 3. Setup totalVI
|
||||
scvi.model.TOTALVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
protein_expression_obsm_key="protein_expression",
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
# 4. Train
|
||||
model = scvi.model.TOTALVI(adata, n_latent=20)
|
||||
model.train(max_epochs=400)
|
||||
|
||||
# 5. Extract joint representation
|
||||
latent = model.get_latent_representation()
|
||||
adata.obsm["X_totalVI"] = latent
|
||||
|
||||
# 6. Clustering on joint space
|
||||
sc.pp.neighbors(adata, use_rep="X_totalVI")
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata, resolution=0.5)
|
||||
|
||||
# 7. Differential expression for both modalities
|
||||
rna_de = model.differential_expression(
|
||||
groupby="leiden",
|
||||
group1="0",
|
||||
group2="1"
|
||||
)
|
||||
|
||||
protein_de = model.differential_expression(
|
||||
groupby="leiden",
|
||||
group1="0",
|
||||
group2="1",
|
||||
protein_expression=True
|
||||
)
|
||||
|
||||
# 8. Save model
|
||||
model.save("totalvi_model")
|
||||
```
|
||||
330
skills/scvi-tools/references/models-scrna-seq.md
Normal file
330
skills/scvi-tools/references/models-scrna-seq.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# Single-Cell RNA-seq Models
|
||||
|
||||
This document covers core models for analyzing single-cell RNA sequencing data in scvi-tools.
|
||||
|
||||
## scVI (Single-Cell Variational Inference)
|
||||
|
||||
**Purpose**: Unsupervised analysis, dimensionality reduction, and batch correction for scRNA-seq data.
|
||||
|
||||
**Key Features**:
|
||||
- Deep generative model based on variational autoencoders (VAE)
|
||||
- Learns low-dimensional latent representations that capture biological variation
|
||||
- Automatically corrects for batch effects and technical covariates
|
||||
- Enables normalized gene expression estimation
|
||||
- Supports differential expression analysis
|
||||
|
||||
**When to Use**:
|
||||
- Initial exploration and dimensionality reduction of scRNA-seq datasets
|
||||
- Integrating multiple batches or studies
|
||||
- Generating batch-corrected expression matrices
|
||||
- Performing probabilistic differential expression analysis
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# Setup data
|
||||
scvi.model.SCVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
# Train model
|
||||
model = scvi.model.SCVI(adata, n_latent=30)
|
||||
model.train()
|
||||
|
||||
# Extract results
|
||||
latent = model.get_latent_representation()
|
||||
normalized = model.get_normalized_expression()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Dimensionality of latent space (default: 10)
|
||||
- `n_layers`: Number of hidden layers (default: 1)
|
||||
- `n_hidden`: Number of nodes per hidden layer (default: 128)
|
||||
- `dropout_rate`: Dropout rate for neural networks (default: 0.1)
|
||||
- `dispersion`: Gene-specific or cell-specific dispersion ("gene" or "gene-batch")
|
||||
- `gene_likelihood`: Distribution for data ("zinb", "nb", "poisson")
|
||||
|
||||
**Outputs**:
|
||||
- `get_latent_representation()`: Batch-corrected low-dimensional embeddings
|
||||
- `get_normalized_expression()`: Denoised, normalized expression values
|
||||
- `differential_expression()`: Probabilistic DE testing between groups
|
||||
- `get_feature_correlation_matrix()`: Gene-gene correlation estimates
|
||||
|
||||
## scANVI (Single-Cell ANnotation using Variational Inference)
|
||||
|
||||
**Purpose**: Semi-supervised cell type annotation and integration using labeled and unlabeled cells.
|
||||
|
||||
**Key Features**:
|
||||
- Extends scVI with cell type labels
|
||||
- Leverages partially labeled datasets for annotation transfer
|
||||
- Performs simultaneous batch correction and cell type prediction
|
||||
- Enables query-to-reference mapping
|
||||
|
||||
**When to Use**:
|
||||
- Annotating new datasets using reference labels
|
||||
- Transfer learning from well-annotated to unlabeled datasets
|
||||
- Joint analysis of labeled and unlabeled cells
|
||||
- Building cell type classifiers with uncertainty quantification
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Option 1: Train from scratch
|
||||
scvi.model.SCANVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch",
|
||||
labels_key="cell_type",
|
||||
unlabeled_category="Unknown"
|
||||
)
|
||||
model = scvi.model.SCANVI(adata)
|
||||
model.train()
|
||||
|
||||
# Option 2: Initialize from pretrained scVI
|
||||
scvi_model = scvi.model.SCVI(adata)
|
||||
scvi_model.train()
|
||||
scanvi_model = scvi.model.SCANVI.from_scvi_model(
|
||||
scvi_model,
|
||||
unlabeled_category="Unknown"
|
||||
)
|
||||
scanvi_model.train()
|
||||
|
||||
# Predict cell types
|
||||
predictions = scanvi_model.predict()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `labels_key`: Column in `adata.obs` containing cell type labels
|
||||
- `unlabeled_category`: Label for cells without annotations
|
||||
- All scVI parameters are also available
|
||||
|
||||
**Outputs**:
|
||||
- `predict()`: Cell type predictions for all cells
|
||||
- `predict_proba()`: Prediction probabilities
|
||||
- `get_latent_representation()`: Cell type-aware latent space
|
||||
|
||||
## AUTOZI
|
||||
|
||||
**Purpose**: Automatic identification and modeling of zero-inflated genes in scRNA-seq data.
|
||||
|
||||
**Key Features**:
|
||||
- Distinguishes biological zeros from technical dropout
|
||||
- Learns which genes exhibit zero-inflation
|
||||
- Provides gene-specific zero-inflation probabilities
|
||||
- Improves downstream analysis by accounting for dropout
|
||||
|
||||
**When to Use**:
|
||||
- Detecting which genes are affected by technical dropout
|
||||
- Improving imputation and normalization for sparse datasets
|
||||
- Understanding the extent of zero-inflation in your data
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.AUTOZI.setup_anndata(adata, layer="counts")
|
||||
model = scvi.model.AUTOZI(adata)
|
||||
model.train()
|
||||
|
||||
# Get zero-inflation probabilities per gene
|
||||
zi_probs = model.get_alphas_betas()
|
||||
```
|
||||
|
||||
## VeloVI
|
||||
|
||||
**Purpose**: RNA velocity analysis using variational inference.
|
||||
|
||||
**Key Features**:
|
||||
- Joint modeling of spliced and unspliced RNA counts
|
||||
- Probabilistic estimation of RNA velocity
|
||||
- Accounts for technical noise and batch effects
|
||||
- Provides uncertainty quantification for velocity estimates
|
||||
|
||||
**When to Use**:
|
||||
- Inferring cellular dynamics and differentiation trajectories
|
||||
- Analyzing spliced/unspliced count data
|
||||
- RNA velocity analysis with batch correction
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvelo as scv
|
||||
|
||||
# Prepare velocity data
|
||||
scv.pp.filter_and_normalize(adata)
|
||||
scv.pp.moments(adata)
|
||||
|
||||
# Train VeloVI
|
||||
scvi.model.VELOVI.setup_anndata(adata, spliced_layer="Ms", unspliced_layer="Mu")
|
||||
model = scvi.model.VELOVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get velocity estimates
|
||||
latent_time = model.get_latent_time()
|
||||
velocities = model.get_velocity()
|
||||
```
|
||||
|
||||
## contrastiveVI
|
||||
|
||||
**Purpose**: Isolating perturbation-specific variations from background biological variation.
|
||||
|
||||
**Key Features**:
|
||||
- Separates shared variation (common across conditions) from target-specific variation
|
||||
- Useful for perturbation studies (drug treatments, genetic perturbations)
|
||||
- Identifies condition-specific gene programs
|
||||
- Enables discovery of treatment-specific effects
|
||||
|
||||
**When to Use**:
|
||||
- Analyzing perturbation experiments (drug screens, CRISPR, etc.)
|
||||
- Identifying genes responding specifically to treatments
|
||||
- Separating treatment effects from background variation
|
||||
- Comparing control vs. perturbed conditions
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.CONTRASTIVEVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch",
|
||||
categorical_covariate_keys=["condition"] # control vs treated
|
||||
)
|
||||
|
||||
model = scvi.model.CONTRASTIVEVI(
|
||||
adata,
|
||||
n_latent=10, # Shared variation
|
||||
n_latent_target=5 # Target-specific variation
|
||||
)
|
||||
model.train()
|
||||
|
||||
# Extract representations
|
||||
shared = model.get_latent_representation(representation="shared")
|
||||
target_specific = model.get_latent_representation(representation="target")
|
||||
```
|
||||
|
||||
## CellAssign
|
||||
|
||||
**Purpose**: Marker-based cell type annotation using known marker genes.
|
||||
|
||||
**Key Features**:
|
||||
- Uses prior knowledge of marker genes for cell types
|
||||
- Probabilistic assignment of cells to types
|
||||
- Handles marker gene overlap and ambiguity
|
||||
- Provides soft assignments with uncertainty
|
||||
|
||||
**When to Use**:
|
||||
- Annotating cells using known marker genes
|
||||
- Leveraging existing biological knowledge for classification
|
||||
- Cases where marker gene lists are available but reference datasets are not
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Create marker gene matrix (cell types x genes)
|
||||
marker_gene_mat = pd.DataFrame({
|
||||
"CD4 T cells": [1, 1, 0, 0], # CD3D, CD4, CD8A, CD19
|
||||
"CD8 T cells": [1, 0, 1, 0],
|
||||
"B cells": [0, 0, 0, 1]
|
||||
}, index=["CD3D", "CD4", "CD8A", "CD19"])
|
||||
|
||||
scvi.model.CELLASSIGN.setup_anndata(adata, layer="counts")
|
||||
model = scvi.model.CELLASSIGN(adata, marker_gene_mat)
|
||||
model.train()
|
||||
|
||||
predictions = model.predict()
|
||||
```
|
||||
|
||||
## Solo (Doublet Detection)
|
||||
|
||||
**Purpose**: Identifying doublets (cells containing two or more cells) in scRNA-seq data.
|
||||
|
||||
**Key Features**:
|
||||
- Semi-supervised doublet detection using scVI embeddings
|
||||
- Simulates artificial doublets for training
|
||||
- Provides doublet probability scores
|
||||
- Can be applied to any scVI model
|
||||
|
||||
**When to Use**:
|
||||
- Quality control of scRNA-seq datasets
|
||||
- Removing doublets before downstream analysis
|
||||
- Assessing doublet rates in your data
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Train scVI model first
|
||||
scvi.model.SCVI.setup_anndata(adata, layer="counts")
|
||||
scvi_model = scvi.model.SCVI(adata)
|
||||
scvi_model.train()
|
||||
|
||||
# Train Solo for doublet detection
|
||||
solo_model = scvi.external.SOLO.from_scvi_model(scvi_model)
|
||||
solo_model.train()
|
||||
|
||||
# Predict doublets
|
||||
predictions = solo_model.predict()
|
||||
doublet_scores = predictions["doublet"]
|
||||
adata.obs["doublet_score"] = doublet_scores
|
||||
```
|
||||
|
||||
## Amortized LDA (Topic Modeling)
|
||||
|
||||
**Purpose**: Topic modeling for gene expression using Latent Dirichlet Allocation.
|
||||
|
||||
**Key Features**:
|
||||
- Discovers gene expression programs (topics)
|
||||
- Amortized variational inference for scalability
|
||||
- Each cell is a mixture of topics
|
||||
- Each topic is a distribution over genes
|
||||
|
||||
**When to Use**:
|
||||
- Discovering gene programs or expression modules
|
||||
- Understanding compositional structure of expression
|
||||
- Alternative dimensionality reduction approach
|
||||
- Interpretable decomposition of expression patterns
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.AMORTIZEDLDA.setup_anndata(adata, layer="counts")
|
||||
model = scvi.model.AMORTIZEDLDA(adata, n_topics=10)
|
||||
model.train()
|
||||
|
||||
# Get topic compositions per cell
|
||||
topic_proportions = model.get_latent_representation()
|
||||
|
||||
# Get gene loadings per topic
|
||||
topic_gene_loadings = model.get_topic_distribution()
|
||||
```
|
||||
|
||||
## Model Selection Guidelines
|
||||
|
||||
**Choose scVI when**:
|
||||
- Starting with unsupervised analysis
|
||||
- Need batch correction and integration
|
||||
- Want normalized expression and DE analysis
|
||||
|
||||
**Choose scANVI when**:
|
||||
- Have some labeled cells for training
|
||||
- Need cell type annotation
|
||||
- Want to transfer labels from reference to query
|
||||
|
||||
**Choose AUTOZI when**:
|
||||
- Concerned about technical dropout
|
||||
- Need to identify zero-inflated genes
|
||||
- Working with very sparse datasets
|
||||
|
||||
**Choose VeloVI when**:
|
||||
- Have spliced/unspliced count data
|
||||
- Interested in cellular dynamics
|
||||
- Need RNA velocity with batch correction
|
||||
|
||||
**Choose contrastiveVI when**:
|
||||
- Analyzing perturbation experiments
|
||||
- Need to separate treatment effects
|
||||
- Want to identify condition-specific programs
|
||||
|
||||
**Choose CellAssign when**:
|
||||
- Have marker gene lists available
|
||||
- Want probabilistic marker-based annotation
|
||||
- No reference dataset available
|
||||
|
||||
**Choose Solo when**:
|
||||
- Need doublet detection
|
||||
- Already using scVI for analysis
|
||||
- Want probabilistic doublet scores
|
||||
438
skills/scvi-tools/references/models-spatial.md
Normal file
438
skills/scvi-tools/references/models-spatial.md
Normal file
@@ -0,0 +1,438 @@
|
||||
# Spatial Transcriptomics Models
|
||||
|
||||
This document covers models for analyzing spatially-resolved transcriptomics data in scvi-tools.
|
||||
|
||||
## DestVI (Deconvolution of Spatial Transcriptomics using Variational Inference)
|
||||
|
||||
**Purpose**: Multi-resolution deconvolution of spatial transcriptomics using single-cell reference data.
|
||||
|
||||
**Key Features**:
|
||||
- Estimates cell type proportions at each spatial location
|
||||
- Uses single-cell RNA-seq reference for deconvolution
|
||||
- Multi-resolution approach (global and local patterns)
|
||||
- Accounts for spatial correlation
|
||||
- Provides uncertainty quantification
|
||||
|
||||
**When to Use**:
|
||||
- Deconvolving Visium or similar spatial transcriptomics
|
||||
- Have scRNA-seq reference data with cell type labels
|
||||
- Want to map cell types to spatial locations
|
||||
- Interested in spatial organization of cell types
|
||||
- Need probabilistic estimates of cell type abundance
|
||||
|
||||
**Data Requirements**:
|
||||
- **Spatial data**: Visium or similar spot-based measurements (target data)
|
||||
- **Single-cell reference**: scRNA-seq with cell type annotations
|
||||
- Both datasets should share genes
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# Step 1: Train scVI on single-cell reference
|
||||
scvi.model.SCVI.setup_anndata(sc_adata, layer="counts")
|
||||
sc_model = scvi.model.SCVI(sc_adata)
|
||||
sc_model.train()
|
||||
|
||||
# Step 2: Setup spatial data
|
||||
scvi.model.DESTVI.setup_anndata(
|
||||
spatial_adata,
|
||||
layer="counts"
|
||||
)
|
||||
|
||||
# Step 3: Train DestVI using reference
|
||||
model = scvi.model.DESTVI.from_rna_model(
|
||||
spatial_adata,
|
||||
sc_model,
|
||||
cell_type_key="cell_type" # Cell type labels in reference
|
||||
)
|
||||
model.train(max_epochs=2500)
|
||||
|
||||
# Step 4: Get cell type proportions
|
||||
proportions = model.get_proportions()
|
||||
spatial_adata.obsm["proportions"] = proportions
|
||||
|
||||
# Step 5: Get cell type-specific expression
|
||||
# Expression of genes specific to each cell type at each spot
|
||||
ct_expression = model.get_scale_for_ct("T cells")
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `amortization`: Amortization strategy ("both", "latent", "proportion")
|
||||
- `n_latent`: Latent dimensionality (inherited from scVI model)
|
||||
|
||||
**Outputs**:
|
||||
- `get_proportions()`: Cell type proportions at each spot
|
||||
- `get_scale_for_ct(cell_type)`: Cell type-specific expression patterns
|
||||
- `get_gamma()`: Proportion-specific gene expression scaling
|
||||
|
||||
**Visualization**:
|
||||
```python
|
||||
import scanpy as sc
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Visualize specific cell type proportions spatially
|
||||
sc.pl.spatial(
|
||||
spatial_adata,
|
||||
color="T cells", # If proportions added to .obs
|
||||
spot_size=150
|
||||
)
|
||||
|
||||
# Or use obsm directly
|
||||
for ct in cell_types:
|
||||
plt.figure()
|
||||
sc.pl.spatial(
|
||||
spatial_adata,
|
||||
color=spatial_adata.obsm["proportions"][ct],
|
||||
title=f"{ct} proportions"
|
||||
)
|
||||
```
|
||||
|
||||
## Stereoscope
|
||||
|
||||
**Purpose**: Cell type deconvolution for spatial transcriptomics using probabilistic modeling.
|
||||
|
||||
**Key Features**:
|
||||
- Reference-based deconvolution
|
||||
- Probabilistic framework for cell type proportions
|
||||
- Works with various spatial technologies
|
||||
- Handles gene selection and normalization
|
||||
|
||||
**When to Use**:
|
||||
- Similar to DestVI but simpler approach
|
||||
- Deconvolving spatial data with reference
|
||||
- Faster alternative for basic deconvolution
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.STEREOSCOPE.setup_anndata(
|
||||
sc_adata,
|
||||
labels_key="cell_type",
|
||||
layer="counts"
|
||||
)
|
||||
|
||||
# Train on reference
|
||||
ref_model = scvi.model.STEREOSCOPE(sc_adata)
|
||||
ref_model.train()
|
||||
|
||||
# Setup spatial data
|
||||
scvi.model.STEREOSCOPE.setup_anndata(spatial_adata, layer="counts")
|
||||
|
||||
# Transfer to spatial
|
||||
spatial_model = scvi.model.STEREOSCOPE.from_reference_model(
|
||||
spatial_adata,
|
||||
ref_model
|
||||
)
|
||||
spatial_model.train()
|
||||
|
||||
# Get proportions
|
||||
proportions = spatial_model.get_proportions()
|
||||
```
|
||||
|
||||
## Tangram
|
||||
|
||||
**Purpose**: Spatial mapping and integration of single-cell data to spatial locations.
|
||||
|
||||
**Key Features**:
|
||||
- Maps single cells to spatial coordinates
|
||||
- Learns optimal transport between single-cell and spatial data
|
||||
- Gene imputation at spatial locations
|
||||
- Cell type mapping
|
||||
|
||||
**When to Use**:
|
||||
- Mapping cells from scRNA-seq to spatial locations
|
||||
- Imputing unmeasured genes in spatial data
|
||||
- Understanding spatial organization at single-cell resolution
|
||||
- Integrating scRNA-seq and spatial transcriptomics
|
||||
|
||||
**Data Requirements**:
|
||||
- Single-cell RNA-seq data with annotations
|
||||
- Spatial transcriptomics data
|
||||
- Shared genes between modalities
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import tangram as tg
|
||||
|
||||
# Map cells to spatial locations
|
||||
ad_map = tg.map_cells_to_space(
|
||||
adata_sc=sc_adata,
|
||||
adata_sp=spatial_adata,
|
||||
mode="cells", # or "clusters" for cell type mapping
|
||||
density_prior="rna_count_based"
|
||||
)
|
||||
|
||||
# Get mapping matrix (cells × spots)
|
||||
mapping = ad_map.X
|
||||
|
||||
# Project cell annotations to space
|
||||
tg.project_cell_annotations(
|
||||
ad_map,
|
||||
spatial_adata,
|
||||
annotation="cell_type"
|
||||
)
|
||||
|
||||
# Impute genes in spatial data
|
||||
genes_to_impute = ["CD3D", "CD8A", "CD4"]
|
||||
tg.project_genes(ad_map, spatial_adata, genes=genes_to_impute)
|
||||
```
|
||||
|
||||
**Visualization**:
|
||||
```python
|
||||
# Visualize cell type mapping
|
||||
sc.pl.spatial(
|
||||
spatial_adata,
|
||||
color="cell_type_projected",
|
||||
spot_size=100
|
||||
)
|
||||
```
|
||||
|
||||
## gimVI (Gaussian Identity Multivi for Imputation)
|
||||
|
||||
**Purpose**: Cross-modality imputation between spatial and single-cell data.
|
||||
|
||||
**Key Features**:
|
||||
- Joint model of spatial and single-cell data
|
||||
- Imputes missing genes in spatial data
|
||||
- Enables cross-dataset queries
|
||||
- Learns shared representations
|
||||
|
||||
**When to Use**:
|
||||
- Imputing genes not measured in spatial data
|
||||
- Joint analysis of spatial and single-cell datasets
|
||||
- Mapping between modalities
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Combine datasets
|
||||
combined_adata = sc.concat([sc_adata, spatial_adata])
|
||||
|
||||
scvi.model.GIMVI.setup_anndata(
|
||||
combined_adata,
|
||||
layer="counts"
|
||||
)
|
||||
|
||||
model = scvi.model.GIMVI(combined_adata)
|
||||
model.train()
|
||||
|
||||
# Impute genes in spatial data
|
||||
imputed = model.get_imputed_values(spatial_indices)
|
||||
```
|
||||
|
||||
## scVIVA (Variation in Variational Autoencoders for Spatial)
|
||||
|
||||
**Purpose**: Analyzing cell-environment relationships in spatial data.
|
||||
|
||||
**Key Features**:
|
||||
- Models cellular neighborhoods and environments
|
||||
- Identifies environment-associated gene expression
|
||||
- Accounts for spatial correlation structure
|
||||
- Cell-cell interaction analysis
|
||||
|
||||
**When to Use**:
|
||||
- Understanding how spatial context affects cells
|
||||
- Identifying niche-specific gene programs
|
||||
- Cell-cell interaction studies
|
||||
- Microenvironment analysis
|
||||
|
||||
**Data Requirements**:
|
||||
- Spatial transcriptomics with coordinates
|
||||
- Cell type annotations (optional)
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.SCVIVA.setup_anndata(
|
||||
spatial_adata,
|
||||
layer="counts",
|
||||
spatial_key="spatial" # Coordinates in .obsm
|
||||
)
|
||||
|
||||
model = scvi.model.SCVIVA(spatial_adata)
|
||||
model.train()
|
||||
|
||||
# Get environment representations
|
||||
env_latent = model.get_environment_representation()
|
||||
|
||||
# Identify environment-associated genes
|
||||
env_genes = model.get_environment_specific_genes()
|
||||
```
|
||||
|
||||
## ResolVI
|
||||
|
||||
**Purpose**: Addressing spatial transcriptomics noise through resolution-aware modeling.
|
||||
|
||||
**Key Features**:
|
||||
- Accounts for spatial resolution effects
|
||||
- Denoises spatial data
|
||||
- Multi-scale analysis
|
||||
- Improves downstream analysis quality
|
||||
|
||||
**When to Use**:
|
||||
- Noisy spatial data
|
||||
- Multiple spatial resolutions
|
||||
- Need denoising before analysis
|
||||
- Improving data quality
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.RESOLVI.setup_anndata(
|
||||
spatial_adata,
|
||||
layer="counts",
|
||||
spatial_key="spatial"
|
||||
)
|
||||
|
||||
model = scvi.model.RESOLVI(spatial_adata)
|
||||
model.train()
|
||||
|
||||
# Get denoised expression
|
||||
denoised = model.get_denoised_expression()
|
||||
```
|
||||
|
||||
## Model Selection for Spatial Transcriptomics
|
||||
|
||||
### DestVI
|
||||
**Choose when**:
|
||||
- Need detailed deconvolution with reference
|
||||
- Have high-quality scRNA-seq reference
|
||||
- Want multi-resolution analysis
|
||||
- Need uncertainty quantification
|
||||
|
||||
**Best for**: Visium, spot-based technologies
|
||||
|
||||
### Stereoscope
|
||||
**Choose when**:
|
||||
- Need simpler, faster deconvolution
|
||||
- Basic cell type proportion estimates
|
||||
- Limited computational resources
|
||||
|
||||
**Best for**: Quick deconvolution tasks
|
||||
|
||||
### Tangram
|
||||
**Choose when**:
|
||||
- Want single-cell resolution mapping
|
||||
- Need to impute many genes
|
||||
- Interested in cell positioning
|
||||
- Optimal transport approach preferred
|
||||
|
||||
**Best for**: Detailed spatial mapping
|
||||
|
||||
### gimVI
|
||||
**Choose when**:
|
||||
- Need bidirectional imputation
|
||||
- Joint modeling of spatial and single-cell
|
||||
- Cross-dataset queries
|
||||
|
||||
**Best for**: Integration and imputation
|
||||
|
||||
### scVIVA
|
||||
**Choose when**:
|
||||
- Interested in cellular environments
|
||||
- Cell-cell interaction analysis
|
||||
- Neighborhood effects
|
||||
|
||||
**Best for**: Microenvironment studies
|
||||
|
||||
### ResolVI
|
||||
**Choose when**:
|
||||
- Data quality is a concern
|
||||
- Need denoising
|
||||
- Multi-scale analysis
|
||||
|
||||
**Best for**: Noisy data preprocessing
|
||||
|
||||
## Complete Workflow: Spatial Deconvolution with DestVI
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
import squidpy as sq
|
||||
|
||||
# ===== Part 1: Prepare single-cell reference =====
|
||||
# Load and process scRNA-seq reference
|
||||
sc_adata = sc.read_h5ad("reference_scrna.h5ad")
|
||||
|
||||
# QC and filtering
|
||||
sc.pp.filter_genes(sc_adata, min_cells=10)
|
||||
sc.pp.highly_variable_genes(sc_adata, n_top_genes=4000)
|
||||
|
||||
# Train scVI on reference
|
||||
scvi.model.SCVI.setup_anndata(
|
||||
sc_adata,
|
||||
layer="counts",
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
sc_model = scvi.model.SCVI(sc_adata)
|
||||
sc_model.train(max_epochs=400)
|
||||
|
||||
# ===== Part 2: Load spatial data =====
|
||||
spatial_adata = sc.read_visium("path/to/visium")
|
||||
spatial_adata.var_names_make_unique()
|
||||
|
||||
# QC spatial data
|
||||
sc.pp.filter_genes(spatial_adata, min_cells=10)
|
||||
|
||||
# ===== Part 3: Run DestVI =====
|
||||
scvi.model.DESTVI.setup_anndata(
|
||||
spatial_adata,
|
||||
layer="counts"
|
||||
)
|
||||
|
||||
destvi_model = scvi.model.DESTVI.from_rna_model(
|
||||
spatial_adata,
|
||||
sc_model,
|
||||
cell_type_key="cell_type"
|
||||
)
|
||||
|
||||
destvi_model.train(max_epochs=2500)
|
||||
|
||||
# ===== Part 4: Extract results =====
|
||||
# Get proportions
|
||||
proportions = destvi_model.get_proportions()
|
||||
spatial_adata.obsm["proportions"] = proportions
|
||||
|
||||
# Add proportions to .obs for easy plotting
|
||||
for i, ct in enumerate(sc_model.adata.obs["cell_type"].cat.categories):
|
||||
spatial_adata.obs[f"prop_{ct}"] = proportions[:, i]
|
||||
|
||||
# ===== Part 5: Visualization =====
|
||||
# Plot specific cell types
|
||||
cell_types = ["T cells", "B cells", "Macrophages"]
|
||||
|
||||
for ct in cell_types:
|
||||
sc.pl.spatial(
|
||||
spatial_adata,
|
||||
color=f"prop_{ct}",
|
||||
title=f"{ct} proportions",
|
||||
spot_size=150,
|
||||
cmap="viridis"
|
||||
)
|
||||
|
||||
# ===== Part 6: Spatial analysis =====
|
||||
# Compute spatial neighbors
|
||||
sq.gr.spatial_neighbors(spatial_adata)
|
||||
|
||||
# Spatial autocorrelation of cell types
|
||||
for ct in cell_types:
|
||||
sq.gr.spatial_autocorr(
|
||||
spatial_adata,
|
||||
attr="obs",
|
||||
mode="moran",
|
||||
genes=[f"prop_{ct}"]
|
||||
)
|
||||
|
||||
# ===== Part 7: Save results =====
|
||||
destvi_model.save("destvi_model")
|
||||
spatial_adata.write("spatial_deconvolved.h5ad")
|
||||
```
|
||||
|
||||
## Best Practices for Spatial Analysis
|
||||
|
||||
1. **Reference quality**: Use high-quality, well-annotated scRNA-seq reference
|
||||
2. **Gene overlap**: Ensure sufficient shared genes between reference and spatial
|
||||
3. **Spatial coordinates**: Properly register spatial coordinates in `.obsm["spatial"]`
|
||||
4. **Validation**: Use known marker genes to validate deconvolution
|
||||
5. **Visualization**: Always visualize results spatially to check biological plausibility
|
||||
6. **Cell type granularity**: Consider appropriate cell type resolution
|
||||
7. **Computational resources**: Spatial models can be memory-intensive
|
||||
8. **Quality control**: Filter low-quality spots before analysis
|
||||
408
skills/scvi-tools/references/models-specialized.md
Normal file
408
skills/scvi-tools/references/models-specialized.md
Normal file
@@ -0,0 +1,408 @@
|
||||
# Specialized Modality Models
|
||||
|
||||
This document covers models for specialized single-cell data modalities in scvi-tools.
|
||||
|
||||
## MethylVI / MethylANVI (Methylation Analysis)
|
||||
|
||||
**Purpose**: Analysis of single-cell bisulfite sequencing (scBS-seq) data for DNA methylation.
|
||||
|
||||
**Key Features**:
|
||||
- Models methylation patterns at single-cell resolution
|
||||
- Handles sparsity in methylation data
|
||||
- Batch correction for methylation experiments
|
||||
- Label transfer (MethylANVI) for cell type annotation
|
||||
|
||||
**When to Use**:
|
||||
- Analyzing scBS-seq or similar methylation data
|
||||
- Studying DNA methylation patterns across cell types
|
||||
- Integrating methylation data across batches
|
||||
- Cell type annotation based on methylation profiles
|
||||
|
||||
**Data Requirements**:
|
||||
- Methylation count matrices (methylated vs. total reads per CpG site)
|
||||
- Format: Cells × CpG sites with methylation ratios or counts
|
||||
|
||||
### MethylVI (Unsupervised)
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# Setup methylation data
|
||||
scvi.model.METHYLVI.setup_anndata(
|
||||
adata,
|
||||
layer="methylation_counts", # Methylation data
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
model = scvi.model.METHYLVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get latent representation
|
||||
latent = model.get_latent_representation()
|
||||
|
||||
# Get normalized methylation values
|
||||
normalized_meth = model.get_normalized_methylation()
|
||||
```
|
||||
|
||||
### MethylANVI (Semi-supervised with cell types)
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Setup with cell type labels
|
||||
scvi.model.METHYLANVI.setup_anndata(
|
||||
adata,
|
||||
layer="methylation_counts",
|
||||
batch_key="batch",
|
||||
labels_key="cell_type",
|
||||
unlabeled_category="Unknown"
|
||||
)
|
||||
|
||||
model = scvi.model.METHYLANVI(adata)
|
||||
model.train()
|
||||
|
||||
# Predict cell types
|
||||
predictions = model.predict()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Latent dimensionality
|
||||
- `region_factors`: Model region-specific effects
|
||||
|
||||
**Use Cases**:
|
||||
- Epigenetic heterogeneity analysis
|
||||
- Cell type identification via methylation
|
||||
- Integration with gene expression data (separate analysis)
|
||||
- Differential methylation analysis
|
||||
|
||||
## CytoVI (Flow and Mass Cytometry)
|
||||
|
||||
**Purpose**: Batch correction and integration of flow cytometry and mass cytometry (CyTOF) data.
|
||||
|
||||
**Key Features**:
|
||||
- Handles antibody-based protein measurements
|
||||
- Corrects batch effects in cytometry data
|
||||
- Enables integration across experiments
|
||||
- Designed for high-dimensional protein panels
|
||||
|
||||
**When to Use**:
|
||||
- Analyzing flow cytometry or CyTOF data
|
||||
- Integrating cytometry experiments across batches
|
||||
- Batch correction for protein panels
|
||||
- Cross-study cytometry integration
|
||||
|
||||
**Data Requirements**:
|
||||
- Protein expression matrix (cells × proteins)
|
||||
- Flow cytometry or CyTOF measurements
|
||||
- Batch/experiment annotations
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.CYTOVI.setup_anndata(
|
||||
adata,
|
||||
protein_expression_obsm_key="protein_expression",
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
model = scvi.model.CYTOVI(adata)
|
||||
model.train()
|
||||
|
||||
# Get batch-corrected representation
|
||||
latent = model.get_latent_representation()
|
||||
|
||||
# Get normalized protein values
|
||||
normalized = model.get_normalized_expression()
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- `n_latent`: Latent space dimensionality
|
||||
- `n_layers`: Network depth
|
||||
|
||||
**Typical Workflow**:
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# 1. Load cytometry data
|
||||
adata = sc.read_h5ad("cytof_data.h5ad")
|
||||
|
||||
# 2. Train CytoVI
|
||||
scvi.model.CYTOVI.setup_anndata(
|
||||
adata,
|
||||
protein_expression_obsm_key="protein",
|
||||
batch_key="experiment"
|
||||
)
|
||||
model = scvi.model.CYTOVI(adata)
|
||||
model.train()
|
||||
|
||||
# 3. Get batch-corrected values
|
||||
latent = model.get_latent_representation()
|
||||
adata.obsm["X_CytoVI"] = latent
|
||||
|
||||
# 4. Downstream analysis
|
||||
sc.pp.neighbors(adata, use_rep="X_CytoVI")
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata)
|
||||
|
||||
# 5. Visualize batch correction
|
||||
sc.pl.umap(adata, color=["batch", "leiden"])
|
||||
```
|
||||
|
||||
## SysVI (Systems-level Integration)
|
||||
|
||||
**Purpose**: Batch effect correction with emphasis on preserving biological variation.
|
||||
|
||||
**Key Features**:
|
||||
- Specialized batch integration approach
|
||||
- Preserves biological signals while removing technical effects
|
||||
- Designed for large-scale integration studies
|
||||
|
||||
**When to Use**:
|
||||
- Large-scale multi-batch integration
|
||||
- Need to preserve subtle biological variation
|
||||
- Systems-level analysis across many studies
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
scvi.model.SYSVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts",
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
model = scvi.model.SYSVI(adata)
|
||||
model.train()
|
||||
|
||||
latent = model.get_latent_representation()
|
||||
```
|
||||
|
||||
## Decipher (Trajectory Inference)
|
||||
|
||||
**Purpose**: Trajectory inference and pseudotime analysis for single-cell data.
|
||||
|
||||
**Key Features**:
|
||||
- Learns cellular trajectories and differentiation paths
|
||||
- Pseudotime estimation
|
||||
- Accounts for uncertainty in trajectory structure
|
||||
- Compatible with scVI embeddings
|
||||
|
||||
**When to Use**:
|
||||
- Studying cellular differentiation
|
||||
- Time-course or developmental datasets
|
||||
- Understanding cell state transitions
|
||||
- Identifying branching points in development
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Typically used after scVI for embeddings
|
||||
scvi_model = scvi.model.SCVI(adata)
|
||||
scvi_model.train()
|
||||
|
||||
# Decipher for trajectory
|
||||
scvi.model.DECIPHER.setup_anndata(adata)
|
||||
decipher_model = scvi.model.DECIPHER(adata, scvi_model)
|
||||
decipher_model.train()
|
||||
|
||||
# Get pseudotime
|
||||
pseudotime = decipher_model.get_pseudotime()
|
||||
adata.obs["pseudotime"] = pseudotime
|
||||
```
|
||||
|
||||
**Visualization**:
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Plot pseudotime on UMAP
|
||||
sc.pl.umap(adata, color="pseudotime", cmap="viridis")
|
||||
|
||||
# Gene expression along pseudotime
|
||||
sc.pl.scatter(adata, x="pseudotime", y="gene_of_interest")
|
||||
```
|
||||
|
||||
## peRegLM (Peak Regulatory Linear Model)
|
||||
|
||||
**Purpose**: Linking chromatin accessibility to gene expression for regulatory analysis.
|
||||
|
||||
**Key Features**:
|
||||
- Links ATAC-seq peaks to gene expression
|
||||
- Identifies regulatory relationships
|
||||
- Works with paired multiome data
|
||||
|
||||
**When to Use**:
|
||||
- Multiome data (RNA + ATAC from same cells)
|
||||
- Understanding gene regulation
|
||||
- Linking peaks to target genes
|
||||
- Regulatory network construction
|
||||
|
||||
**Basic Usage**:
|
||||
```python
|
||||
# Requires paired RNA + ATAC data
|
||||
scvi.model.PEREGLM.setup_anndata(
|
||||
multiome_adata,
|
||||
rna_layer="counts",
|
||||
atac_layer="atac_counts"
|
||||
)
|
||||
|
||||
model = scvi.model.PEREGLM(multiome_adata)
|
||||
model.train()
|
||||
|
||||
# Get peak-gene links
|
||||
peak_gene_links = model.get_regulatory_links()
|
||||
```
|
||||
|
||||
## Model-Specific Best Practices
|
||||
|
||||
### MethylVI/MethylANVI
|
||||
1. **Sparsity**: Methylation data is inherently sparse; model accounts for this
|
||||
2. **CpG selection**: Filter CpGs with very low coverage
|
||||
3. **Biological interpretation**: Consider genomic context (promoters, enhancers)
|
||||
4. **Integration**: For multi-omics, analyze separately then integrate results
|
||||
|
||||
### CytoVI
|
||||
1. **Protein QC**: Remove low-quality or uninformative proteins
|
||||
2. **Compensation**: Ensure proper spectral compensation before analysis
|
||||
3. **Batch design**: Include biological and technical replicates
|
||||
4. **Controls**: Use control samples to validate batch correction
|
||||
|
||||
### SysVI
|
||||
1. **Sample size**: Designed for large-scale integration
|
||||
2. **Batch definition**: Carefully define batch structure
|
||||
3. **Biological validation**: Verify biological signals preserved
|
||||
|
||||
### Decipher
|
||||
1. **Start point**: Define trajectory start cells if known
|
||||
2. **Branching**: Specify expected number of branches
|
||||
3. **Validation**: Use known markers to validate pseudotime
|
||||
4. **Integration**: Works well with scVI embeddings
|
||||
|
||||
## Integration with Other Models
|
||||
|
||||
Many specialized models work well in combination:
|
||||
|
||||
**Methylation + Expression**:
|
||||
```python
|
||||
# Analyze separately, then integrate
|
||||
methylvi_model = scvi.model.METHYLVI(meth_adata)
|
||||
scvi_model = scvi.model.SCVI(rna_adata)
|
||||
|
||||
# Integrate results at analysis level
|
||||
# E.g., correlate methylation and expression patterns
|
||||
```
|
||||
|
||||
**Cytometry + CITE-seq**:
|
||||
```python
|
||||
# CytoVI for flow/CyTOF
|
||||
cyto_model = scvi.model.CYTOVI(cyto_adata)
|
||||
|
||||
# totalVI for CITE-seq
|
||||
cite_model = scvi.model.TOTALVI(cite_adata)
|
||||
|
||||
# Compare protein measurements across platforms
|
||||
```
|
||||
|
||||
**ATAC + RNA (Multiome)**:
|
||||
```python
|
||||
# MultiVI for joint analysis
|
||||
multivi_model = scvi.model.MULTIVI(multiome_adata)
|
||||
|
||||
# peRegLM for regulatory links
|
||||
pereglm_model = scvi.model.PEREGLM(multiome_adata)
|
||||
```
|
||||
|
||||
## Choosing Specialized Models
|
||||
|
||||
### Decision Tree
|
||||
|
||||
1. **What data modality?**
|
||||
- Methylation → MethylVI/MethylANVI
|
||||
- Flow/CyTOF → CytoVI
|
||||
- Trajectory → Decipher
|
||||
- Multi-batch integration → SysVI
|
||||
- Regulatory links → peRegLM
|
||||
|
||||
2. **Do you have labels?**
|
||||
- Yes → MethylANVI (methylation)
|
||||
- No → MethylVI (methylation)
|
||||
|
||||
3. **What's your main goal?**
|
||||
- Batch correction → CytoVI, SysVI
|
||||
- Trajectory/pseudotime → Decipher
|
||||
- Peak-gene links → peRegLM
|
||||
- Methylation patterns → MethylVI/ANVI
|
||||
|
||||
## Example: Complete Methylation Analysis
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
|
||||
# 1. Load methylation data
|
||||
meth_adata = sc.read_h5ad("methylation_data.h5ad")
|
||||
|
||||
# 2. QC: filter low-coverage CpG sites
|
||||
sc.pp.filter_genes(meth_adata, min_cells=10)
|
||||
|
||||
# 3. Setup MethylVI
|
||||
scvi.model.METHYLVI.setup_anndata(
|
||||
meth_adata,
|
||||
layer="methylation",
|
||||
batch_key="batch"
|
||||
)
|
||||
|
||||
# 4. Train model
|
||||
model = scvi.model.METHYLVI(meth_adata, n_latent=15)
|
||||
model.train(max_epochs=400)
|
||||
|
||||
# 5. Get latent representation
|
||||
latent = model.get_latent_representation()
|
||||
meth_adata.obsm["X_MethylVI"] = latent
|
||||
|
||||
# 6. Clustering
|
||||
sc.pp.neighbors(meth_adata, use_rep="X_MethylVI")
|
||||
sc.tl.umap(meth_adata)
|
||||
sc.tl.leiden(meth_adata)
|
||||
|
||||
# 7. Differential methylation
|
||||
dm_results = model.differential_methylation(
|
||||
groupby="leiden",
|
||||
group1="0",
|
||||
group2="1"
|
||||
)
|
||||
|
||||
# 8. Save
|
||||
model.save("methylvi_model")
|
||||
meth_adata.write("methylation_analyzed.h5ad")
|
||||
```
|
||||
|
||||
## External Tools Integration
|
||||
|
||||
Some specialized models are available as external packages:
|
||||
|
||||
**SOLO** (doublet detection):
|
||||
```python
|
||||
from scvi.external import SOLO
|
||||
|
||||
solo = SOLO.from_scvi_model(scvi_model)
|
||||
solo.train()
|
||||
doublets = solo.predict()
|
||||
```
|
||||
|
||||
**scArches** (reference mapping):
|
||||
```python
|
||||
from scvi.external import SCARCHES
|
||||
|
||||
# For transfer learning and query-to-reference mapping
|
||||
```
|
||||
|
||||
These external tools extend scvi-tools functionality for specific use cases.
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Model | Data Type | Primary Use | Supervised? |
|
||||
|-------|-----------|-------------|-------------|
|
||||
| MethylVI | Methylation | Unsupervised analysis | No |
|
||||
| MethylANVI | Methylation | Cell type annotation | Semi |
|
||||
| CytoVI | Cytometry | Batch correction | No |
|
||||
| SysVI | scRNA-seq | Large-scale integration | No |
|
||||
| Decipher | scRNA-seq | Trajectory inference | No |
|
||||
| peRegLM | Multiome | Peak-gene links | No |
|
||||
| SOLO | scRNA-seq | Doublet detection | Semi |
|
||||
438
skills/scvi-tools/references/theoretical-foundations.md
Normal file
438
skills/scvi-tools/references/theoretical-foundations.md
Normal file
@@ -0,0 +1,438 @@
|
||||
# Theoretical Foundations of scvi-tools
|
||||
|
||||
This document explains the mathematical and statistical principles underlying scvi-tools.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Variational Inference
|
||||
|
||||
**What is it?**
|
||||
Variational inference is a technique for approximating complex probability distributions. In single-cell analysis, we want to understand the posterior distribution p(z|x) - the probability of latent variables z given observed data x.
|
||||
|
||||
**Why use it?**
|
||||
- Exact inference is computationally intractable for complex models
|
||||
- Scales to large datasets (millions of cells)
|
||||
- Provides uncertainty quantification
|
||||
- Enables Bayesian reasoning about cell states
|
||||
|
||||
**How does it work?**
|
||||
1. Define a simpler approximate distribution q(z|x) with learnable parameters
|
||||
2. Minimize the KL divergence between q(z|x) and true posterior p(z|x)
|
||||
3. Equivalent to maximizing the Evidence Lower Bound (ELBO)
|
||||
|
||||
**ELBO Objective**:
|
||||
```
|
||||
ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
|
||||
↑ ↑
|
||||
Reconstruction Regularization
|
||||
```
|
||||
|
||||
- **Reconstruction term**: Model should generate data similar to observed
|
||||
- **Regularization term**: Latent representation should match prior
|
||||
|
||||
### Variational Autoencoders (VAEs)
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
x (observed data)
|
||||
↓
|
||||
[Encoder Neural Network]
|
||||
↓
|
||||
z (latent representation)
|
||||
↓
|
||||
[Decoder Neural Network]
|
||||
↓
|
||||
x̂ (reconstructed data)
|
||||
```
|
||||
|
||||
**Encoder**: Maps cells (x) to latent space (z)
|
||||
- Learns q(z|x), the approximate posterior
|
||||
- Parameterized by neural network with learnable weights
|
||||
- Outputs mean and variance of latent distribution
|
||||
|
||||
**Decoder**: Maps latent space (z) back to gene space
|
||||
- Learns p(x|z), the likelihood
|
||||
- Generates gene expression from latent representation
|
||||
- Models count distributions (Negative Binomial, Zero-Inflated NB)
|
||||
|
||||
**Reparameterization Trick**:
|
||||
- Allows backpropagation through stochastic sampling
|
||||
- Sample z = μ + σ ⊙ ε, where ε ~ N(0,1)
|
||||
- Enables end-to-end training with gradient descent
|
||||
|
||||
### Amortized Inference
|
||||
|
||||
**Concept**: Share encoder parameters across all cells.
|
||||
|
||||
**Traditional inference**: Learn separate latent variables for each cell
|
||||
- n_cells × n_latent parameters
|
||||
- Doesn't scale to large datasets
|
||||
|
||||
**Amortized inference**: Learn single encoder for all cells
|
||||
- Fixed number of parameters regardless of cell count
|
||||
- Enables fast inference on new cells
|
||||
- Transfers learned patterns across dataset
|
||||
|
||||
**Benefits**:
|
||||
- Scalable to millions of cells
|
||||
- Fast inference on query data
|
||||
- Leverages shared structure across cells
|
||||
- Enables few-shot learning
|
||||
|
||||
## Statistical Modeling
|
||||
|
||||
### Count Data Distributions
|
||||
|
||||
Single-cell data are counts (integer-valued), requiring appropriate distributions.
|
||||
|
||||
#### Negative Binomial (NB)
|
||||
```
|
||||
x ~ NB(μ, θ)
|
||||
```
|
||||
- **μ (mean)**: Expected expression level
|
||||
- **θ (dispersion)**: Controls variance
|
||||
- **Variance**: Var(x) = μ + μ²/θ
|
||||
|
||||
**When to use**: Gene expression without zero-inflation
|
||||
- More flexible than Poisson (allows overdispersion)
|
||||
- Models technical and biological variation
|
||||
|
||||
#### Zero-Inflated Negative Binomial (ZINB)
|
||||
```
|
||||
x ~ π·δ₀ + (1-π)·NB(μ, θ)
|
||||
```
|
||||
- **π (dropout rate)**: Probability of technical zero
|
||||
- **δ₀**: Point mass at zero
|
||||
- **NB(μ, θ)**: Expression when not dropped out
|
||||
|
||||
**When to use**: Sparse scRNA-seq data
|
||||
- Models technical dropout separately from biological zeros
|
||||
- Better fit for highly sparse data (e.g., 10x data)
|
||||
|
||||
#### Poisson
|
||||
```
|
||||
x ~ Poisson(μ)
|
||||
```
|
||||
- Simplest count distribution
|
||||
- Mean equals variance: Var(x) = μ
|
||||
|
||||
**When to use**: Less common; ATAC-seq fragment counts
|
||||
- More restrictive than NB
|
||||
- Faster computation
|
||||
|
||||
### Batch Correction Framework
|
||||
|
||||
**Problem**: Technical variation confounds biological signal
|
||||
- Different sequencing runs, protocols, labs
|
||||
- Must remove technical effects while preserving biology
|
||||
|
||||
**scvi-tools approach**:
|
||||
1. Encode batch as categorical variable s
|
||||
2. Include s in generative model
|
||||
3. Latent space z is batch-invariant
|
||||
4. Decoder conditions on s for batch-specific effects
|
||||
|
||||
**Mathematical formulation**:
|
||||
```
|
||||
Encoder: q(z|x, s) - batch-aware encoding
|
||||
Latent: z - batch-corrected representation
|
||||
Decoder: p(x|z, s) - batch-specific decoding
|
||||
```
|
||||
|
||||
**Key insight**: Batch info flows through decoder, not latent space
|
||||
- z captures biological variation
|
||||
- s explains technical variation
|
||||
- Separable biology and batch effects
|
||||
|
||||
### Deep Generative Modeling
|
||||
|
||||
**Generative model**: Learns p(x), the data distribution
|
||||
|
||||
**Process**:
|
||||
1. Sample latent variable: z ~ p(z) = N(0, I)
|
||||
2. Generate expression: x ~ p(x|z)
|
||||
3. Joint distribution: p(x, z) = p(x|z)p(z)
|
||||
|
||||
**Benefits**:
|
||||
- Generate synthetic cells
|
||||
- Impute missing values
|
||||
- Quantify uncertainty
|
||||
- Perform counterfactual predictions
|
||||
|
||||
**Inference network**: Inverts generative process
|
||||
- Given x, infer z
|
||||
- q(z|x) approximates true posterior p(z|x)
|
||||
|
||||
## Model Architecture Details
|
||||
|
||||
### scVI Architecture
|
||||
|
||||
**Input**: Gene expression counts x ∈ ℕ^G (G genes)
|
||||
|
||||
**Encoder**:
|
||||
```
|
||||
h = ReLU(W₁·x + b₁)
|
||||
μ_z = W₂·h + b₂
|
||||
log σ²_z = W₃·h + b₃
|
||||
z ~ N(μ_z, σ²_z)
|
||||
```
|
||||
|
||||
**Latent space**: z ∈ ℝ^d (typically d=10-30)
|
||||
|
||||
**Decoder**:
|
||||
```
|
||||
h = ReLU(W₄·z + b₄)
|
||||
μ = softmax(W₅·h + b₅) · library_size
|
||||
θ = exp(W₆·h + b₆)
|
||||
π = sigmoid(W₇·h + b₇) # for ZINB
|
||||
x ~ ZINB(μ, θ, π)
|
||||
```
|
||||
|
||||
**Loss function (ELBO)**:
|
||||
```
|
||||
L = E_q[log p(x|z)] - KL(q(z|x) || N(0,I))
|
||||
```
|
||||
|
||||
### Handling Covariates
|
||||
|
||||
**Categorical covariates** (batch, donor, etc.):
|
||||
- One-hot encoded: s ∈ {0,1}^K
|
||||
- Concatenate with latent: [z, s]
|
||||
- Or use conditional layers
|
||||
|
||||
**Continuous covariates** (library size, percent_mito):
|
||||
- Standardize to zero mean, unit variance
|
||||
- Include in encoder and/or decoder
|
||||
|
||||
**Covariate injection strategies**:
|
||||
- **Concatenation**: [z, s] fed to decoder
|
||||
- **Deep injection**: s added at multiple layers
|
||||
- **Conditional batch norm**: Batch-specific normalization
|
||||
|
||||
## Advanced Theoretical Concepts
|
||||
|
||||
### Transfer Learning (scArches)
|
||||
|
||||
**Concept**: Use pretrained model as initialization for new data
|
||||
|
||||
**Process**:
|
||||
1. Train reference model on large dataset
|
||||
2. Freeze encoder parameters
|
||||
3. Fine-tune decoder on query data
|
||||
4. Or fine-tune all with lower learning rate
|
||||
|
||||
**Why it works**:
|
||||
- Encoder learns general cellular representations
|
||||
- Decoder adapts to query-specific characteristics
|
||||
- Prevents catastrophic forgetting
|
||||
|
||||
**Applications**:
|
||||
- Query-to-reference mapping
|
||||
- Few-shot learning for rare cell types
|
||||
- Rapid analysis of new datasets
|
||||
|
||||
### Multi-Resolution Modeling (MrVI)
|
||||
|
||||
**Idea**: Separate shared and sample-specific variation
|
||||
|
||||
**Latent space decomposition**:
|
||||
```
|
||||
z = z_shared + z_sample
|
||||
```
|
||||
- **z_shared**: Common across samples
|
||||
- **z_sample**: Sample-specific effects
|
||||
|
||||
**Hierarchical structure**:
|
||||
```
|
||||
Sample level: ρ_s ~ N(0, I)
|
||||
Cell level: z_i ~ N(ρ_{s(i)}, σ²)
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Disentangle biological sources of variation
|
||||
- Compare samples at different resolutions
|
||||
- Identify sample-specific cell states
|
||||
|
||||
### Counterfactual Prediction
|
||||
|
||||
**Goal**: Predict outcome under different conditions
|
||||
|
||||
**Example**: "What would this cell look like if from different batch?"
|
||||
|
||||
**Method**:
|
||||
1. Encode cell to latent: z = Encoder(x, s_original)
|
||||
2. Decode with new condition: x_new = Decoder(z, s_new)
|
||||
3. x_new is counterfactual prediction
|
||||
|
||||
**Applications**:
|
||||
- Batch effect assessment
|
||||
- Predicting treatment response
|
||||
- In silico perturbation studies
|
||||
|
||||
### Posterior Predictive Distribution
|
||||
|
||||
**Definition**: Distribution of new data given observed data
|
||||
|
||||
```
|
||||
p(x_new | x_observed) = ∫ p(x_new|z) q(z|x_observed) dz
|
||||
```
|
||||
|
||||
**Estimation**: Sample z from q(z|x), generate x_new from p(x_new|z)
|
||||
|
||||
**Uses**:
|
||||
- Uncertainty quantification
|
||||
- Robust predictions
|
||||
- Outlier detection
|
||||
|
||||
## Differential Expression Framework
|
||||
|
||||
### Bayesian Approach
|
||||
|
||||
**Traditional methods**: Compare point estimates
|
||||
- Wilcoxon, t-test, etc.
|
||||
- Ignore uncertainty
|
||||
- Require pseudocounts
|
||||
|
||||
**scvi-tools approach**: Compare distributions
|
||||
- Sample from posterior: μ_A ~ p(μ|x_A), μ_B ~ p(μ|x_B)
|
||||
- Compute log fold-change: LFC = log(μ_B) - log(μ_A)
|
||||
- Posterior distribution of LFC quantifies uncertainty
|
||||
|
||||
### Bayes Factor
|
||||
|
||||
**Definition**: Ratio of posterior odds to prior odds
|
||||
|
||||
```
|
||||
BF = P(H₁|data) / P(H₀|data)
|
||||
─────────────────────────
|
||||
P(H₁) / P(H₀)
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- BF > 3: Moderate evidence for H₁
|
||||
- BF > 10: Strong evidence
|
||||
- BF > 100: Decisive evidence
|
||||
|
||||
**In scvi-tools**: Used to rank genes by evidence for DE
|
||||
|
||||
### False Discovery Proportion (FDP)
|
||||
|
||||
**Goal**: Control expected false discovery rate
|
||||
|
||||
**Procedure**:
|
||||
1. For each gene, compute posterior probability of DE
|
||||
2. Rank genes by evidence (Bayes factor)
|
||||
3. Select top k genes such that E[FDP] ≤ α
|
||||
|
||||
**Advantage over p-values**:
|
||||
- Fully Bayesian
|
||||
- Natural for posterior inference
|
||||
- No arbitrary thresholds
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Optimization
|
||||
|
||||
**Optimizer**: Adam (adaptive learning rates)
|
||||
- Default lr = 0.001
|
||||
- Momentum parameters: β₁=0.9, β₂=0.999
|
||||
|
||||
**Training loop**:
|
||||
1. Sample mini-batch of cells
|
||||
2. Compute ELBO loss
|
||||
3. Backpropagate gradients
|
||||
4. Update parameters with Adam
|
||||
5. Repeat until convergence
|
||||
|
||||
**Convergence criteria**:
|
||||
- ELBO plateaus on validation set
|
||||
- Early stopping prevents overfitting
|
||||
- Typically 200-500 epochs
|
||||
|
||||
### Regularization
|
||||
|
||||
**KL annealing**: Gradually increase KL weight
|
||||
- Prevents posterior collapse
|
||||
- Starts at 0, increases to 1 over epochs
|
||||
|
||||
**Dropout**: Random neuron dropping during training
|
||||
- Default: 0.1 dropout rate
|
||||
- Prevents overfitting
|
||||
- Improves generalization
|
||||
|
||||
**Weight decay**: L2 regularization on weights
|
||||
- Prevents large weights
|
||||
- Improves stability
|
||||
|
||||
### Scalability
|
||||
|
||||
**Mini-batch training**:
|
||||
- Process subset of cells per iteration
|
||||
- Batch size: 64-256 cells
|
||||
- Enables scaling to millions of cells
|
||||
|
||||
**Stochastic optimization**:
|
||||
- Estimates ELBO on mini-batches
|
||||
- Unbiased gradient estimates
|
||||
- Converges to optimal solution
|
||||
|
||||
**GPU acceleration**:
|
||||
- Neural networks naturally parallelize
|
||||
- Order of magnitude speedup
|
||||
- Essential for large datasets
|
||||
|
||||
## Connections to Other Methods
|
||||
|
||||
### vs. PCA
|
||||
- **PCA**: Linear, deterministic
|
||||
- **scVI**: Nonlinear, probabilistic
|
||||
- **Advantage**: scVI captures complex structure, handles counts
|
||||
|
||||
### vs. t-SNE/UMAP
|
||||
- **t-SNE/UMAP**: Visualization-focused
|
||||
- **scVI**: Full generative model
|
||||
- **Advantage**: scVI enables downstream tasks (DE, imputation)
|
||||
|
||||
### vs. Seurat Integration
|
||||
- **Seurat**: Anchor-based alignment
|
||||
- **scVI**: Probabilistic modeling
|
||||
- **Advantage**: scVI provides uncertainty, works for multiple batches
|
||||
|
||||
### vs. Harmony
|
||||
- **Harmony**: PCA + batch correction
|
||||
- **scVI**: VAE-based
|
||||
- **Advantage**: scVI handles counts natively, more flexible
|
||||
|
||||
## Mathematical Notation
|
||||
|
||||
**Common symbols**:
|
||||
- x: Observed gene expression (counts)
|
||||
- z: Latent representation
|
||||
- θ: Model parameters
|
||||
- q(z|x): Approximate posterior (encoder)
|
||||
- p(x|z): Likelihood (decoder)
|
||||
- p(z): Prior on latent variables
|
||||
- μ, σ²: Mean and variance
|
||||
- π: Dropout probability (ZINB)
|
||||
- θ (in NB): Dispersion parameter
|
||||
- s: Batch/covariate indicator
|
||||
|
||||
## Further Reading
|
||||
|
||||
**Key Papers**:
|
||||
1. Lopez et al. (2018): "Deep generative modeling for single-cell transcriptomics"
|
||||
2. Xu et al. (2021): "Probabilistic harmonization and annotation of single-cell transcriptomics"
|
||||
3. Boyeau et al. (2019): "Deep generative models for detecting differential expression in single cells"
|
||||
|
||||
**Concepts to explore**:
|
||||
- Variational inference in machine learning
|
||||
- Bayesian deep learning
|
||||
- Information theory (KL divergence, mutual information)
|
||||
- Generative models (GANs, normalizing flows, diffusion models)
|
||||
- Probabilistic programming (Pyro, PyTorch)
|
||||
|
||||
**Mathematical background**:
|
||||
- Probability theory and statistics
|
||||
- Linear algebra and calculus
|
||||
- Optimization theory
|
||||
- Information theory
|
||||
546
skills/scvi-tools/references/workflows.md
Normal file
546
skills/scvi-tools/references/workflows.md
Normal file
@@ -0,0 +1,546 @@
|
||||
# Common Workflows and Best Practices
|
||||
|
||||
This document covers common workflows, best practices, and advanced usage patterns for scvi-tools.
|
||||
|
||||
## Standard Analysis Workflow
|
||||
|
||||
### 1. Data Loading and Preparation
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
import numpy as np
|
||||
|
||||
# Load data (AnnData format required)
|
||||
adata = sc.read_h5ad("data.h5ad")
|
||||
# Or load from other formats
|
||||
# adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")
|
||||
# adata = sc.read_csv("counts.csv")
|
||||
|
||||
# Basic QC metrics
|
||||
sc.pp.calculate_qc_metrics(adata, inplace=True)
|
||||
adata.var['mt'] = adata.var_names.str.startswith('MT-')
|
||||
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
|
||||
```
|
||||
|
||||
### 2. Quality Control
|
||||
|
||||
```python
|
||||
# Filter cells
|
||||
sc.pp.filter_cells(adata, min_genes=200)
|
||||
sc.pp.filter_cells(adata, max_genes=5000)
|
||||
|
||||
# Filter genes
|
||||
sc.pp.filter_genes(adata, min_cells=3)
|
||||
|
||||
# Filter by mitochondrial content
|
||||
adata = adata[adata.obs['pct_counts_mt'] < 20, :]
|
||||
|
||||
# Remove doublets (optional, before training)
|
||||
sc.external.pp.scrublet(adata)
|
||||
adata = adata[~adata.obs['predicted_doublet'], :]
|
||||
```
|
||||
|
||||
### 3. Preprocessing for scvi-tools
|
||||
|
||||
```python
|
||||
# IMPORTANT: scvi-tools needs RAW counts
|
||||
# If you've already normalized, use the raw layer or reload data
|
||||
|
||||
# Save raw counts if not already available
|
||||
if 'counts' not in adata.layers:
|
||||
adata.layers['counts'] = adata.X.copy()
|
||||
|
||||
# Feature selection (optional but recommended)
|
||||
sc.pp.highly_variable_genes(
|
||||
adata,
|
||||
n_top_genes=4000,
|
||||
subset=False, # Keep all genes, just mark HVGs
|
||||
batch_key="batch" # If multiple batches
|
||||
)
|
||||
|
||||
# Filter to HVGs (optional)
|
||||
# adata = adata[:, adata.var['highly_variable']]
|
||||
```
|
||||
|
||||
### 4. Register Data with scvi-tools
|
||||
|
||||
```python
|
||||
# Setup AnnData for scvi-tools
|
||||
scvi.model.SCVI.setup_anndata(
|
||||
adata,
|
||||
layer="counts", # Use raw counts
|
||||
batch_key="batch", # Technical batches
|
||||
categorical_covariate_keys=["donor", "condition"],
|
||||
continuous_covariate_keys=["percent_mito", "n_counts"]
|
||||
)
|
||||
|
||||
# Check registration
|
||||
adata.uns['_scvi']['summary_stats']
|
||||
```
|
||||
|
||||
### 5. Model Training
|
||||
|
||||
```python
|
||||
# Create model
|
||||
model = scvi.model.SCVI(
|
||||
adata,
|
||||
n_latent=30, # Latent dimensions
|
||||
n_layers=2, # Network depth
|
||||
n_hidden=128, # Hidden layer size
|
||||
dropout_rate=0.1,
|
||||
gene_likelihood="zinb" # zero-inflated negative binomial
|
||||
)
|
||||
|
||||
# Train model
|
||||
model.train(
|
||||
max_epochs=400,
|
||||
batch_size=128,
|
||||
train_size=0.9,
|
||||
early_stopping=True,
|
||||
check_val_every_n_epoch=10
|
||||
)
|
||||
|
||||
# View training history
|
||||
train_history = model.history["elbo_train"]
|
||||
val_history = model.history["elbo_validation"]
|
||||
```
|
||||
|
||||
### 6. Extract Results
|
||||
|
||||
```python
|
||||
# Get latent representation
|
||||
latent = model.get_latent_representation()
|
||||
adata.obsm["X_scVI"] = latent
|
||||
|
||||
# Get normalized expression
|
||||
normalized = model.get_normalized_expression(
|
||||
adata,
|
||||
library_size=1e4,
|
||||
n_samples=25 # Monte Carlo samples
|
||||
)
|
||||
adata.layers["scvi_normalized"] = normalized
|
||||
```
|
||||
|
||||
### 7. Downstream Analysis
|
||||
|
||||
```python
|
||||
# Clustering on scVI latent space
|
||||
sc.pp.neighbors(adata, use_rep="X_scVI", n_neighbors=15)
|
||||
sc.tl.umap(adata, min_dist=0.3)
|
||||
sc.tl.leiden(adata, resolution=0.8, key_added="leiden")
|
||||
|
||||
# Visualization
|
||||
sc.pl.umap(adata, color=["leiden", "batch", "cell_type"])
|
||||
|
||||
# Differential expression
|
||||
de_results = model.differential_expression(
|
||||
groupby="leiden",
|
||||
group1="0",
|
||||
group2="1",
|
||||
mode="change",
|
||||
delta=0.25
|
||||
)
|
||||
```
|
||||
|
||||
### 8. Model Persistence
|
||||
|
||||
```python
|
||||
# Save model
|
||||
model_dir = "./scvi_model/"
|
||||
model.save(model_dir, overwrite=True)
|
||||
|
||||
# Save AnnData with results
|
||||
adata.write("analyzed_data.h5ad")
|
||||
|
||||
# Load model later
|
||||
model = scvi.model.SCVI.load(model_dir, adata=adata)
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### Key Hyperparameters
|
||||
|
||||
**Architecture**:
|
||||
- `n_latent`: Latent space dimensionality (10-50)
|
||||
- Larger for complex, heterogeneous datasets
|
||||
- Smaller for simple datasets or to prevent overfitting
|
||||
- `n_layers`: Number of hidden layers (1-3)
|
||||
- More layers for complex data, but diminishing returns
|
||||
- `n_hidden`: Nodes per hidden layer (64-256)
|
||||
- Scale with dataset size and complexity
|
||||
|
||||
**Training**:
|
||||
- `max_epochs`: Training iterations (200-500)
|
||||
- Use early stopping to prevent overfitting
|
||||
- `batch_size`: Samples per batch (64-256)
|
||||
- Larger for big datasets, smaller for limited memory
|
||||
- `lr`: Learning rate (0.001 default, usually good)
|
||||
|
||||
**Model-specific**:
|
||||
- `gene_likelihood`: Distribution ("zinb", "nb", "poisson")
|
||||
- "zinb" for sparse data with zero-inflation
|
||||
- "nb" for less sparse data
|
||||
- `dispersion`: Gene or gene-batch specific
|
||||
- "gene" for simple, "gene-batch" for complex batch effects
|
||||
|
||||
### Hyperparameter Search Example
|
||||
|
||||
```python
|
||||
from scvi.model import SCVI
|
||||
|
||||
# Define search space
|
||||
latent_dims = [10, 20, 30]
|
||||
n_layers_options = [1, 2]
|
||||
|
||||
best_score = float('-inf')
|
||||
best_params = None
|
||||
|
||||
for n_latent in latent_dims:
|
||||
for n_layers in n_layers_options:
|
||||
model = SCVI(
|
||||
adata,
|
||||
n_latent=n_latent,
|
||||
n_layers=n_layers
|
||||
)
|
||||
model.train(max_epochs=200)
|
||||
|
||||
# Evaluate on validation set
|
||||
val_elbo = model.history["elbo_validation"][-1]
|
||||
|
||||
if val_elbo > best_score:
|
||||
best_score = val_elbo
|
||||
best_params = {"n_latent": n_latent, "n_layers": n_layers}
|
||||
|
||||
print(f"Best params: {best_params}")
|
||||
```
|
||||
|
||||
### Using Optuna for Hyperparameter Optimization
|
||||
|
||||
```python
|
||||
import optuna
|
||||
|
||||
def objective(trial):
|
||||
n_latent = trial.suggest_int("n_latent", 10, 50)
|
||||
n_layers = trial.suggest_int("n_layers", 1, 3)
|
||||
n_hidden = trial.suggest_categorical("n_hidden", [64, 128, 256])
|
||||
|
||||
model = scvi.model.SCVI(
|
||||
adata,
|
||||
n_latent=n_latent,
|
||||
n_layers=n_layers,
|
||||
n_hidden=n_hidden
|
||||
)
|
||||
|
||||
model.train(max_epochs=200, early_stopping=True)
|
||||
return model.history["elbo_validation"][-1]
|
||||
|
||||
study = optuna.create_study(direction="maximize")
|
||||
study.optimize(objective, n_trials=20)
|
||||
|
||||
print(f"Best parameters: {study.best_params}")
|
||||
```
|
||||
|
||||
## GPU Acceleration
|
||||
|
||||
### Enable GPU Training
|
||||
|
||||
```python
|
||||
# Automatic GPU detection
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train(accelerator="auto") # Uses GPU if available
|
||||
|
||||
# Force GPU
|
||||
model.train(accelerator="gpu")
|
||||
|
||||
# Multi-GPU
|
||||
model.train(accelerator="gpu", devices=2)
|
||||
|
||||
# Check if GPU is being used
|
||||
import torch
|
||||
print(f"CUDA available: {torch.cuda.is_available()}")
|
||||
print(f"GPU count: {torch.cuda.device_count()}")
|
||||
```
|
||||
|
||||
### GPU Memory Management
|
||||
|
||||
```python
|
||||
# Reduce batch size if OOM
|
||||
model.train(batch_size=64) # Instead of default 128
|
||||
|
||||
# Mixed precision training (saves memory)
|
||||
model.train(precision=16)
|
||||
|
||||
# Clear cache between runs
|
||||
import torch
|
||||
torch.cuda.empty_cache()
|
||||
```
|
||||
|
||||
## Batch Integration Strategies
|
||||
|
||||
### Strategy 1: Simple Batch Key
|
||||
|
||||
```python
|
||||
# For standard batch correction
|
||||
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
|
||||
model = scvi.model.SCVI(adata)
|
||||
```
|
||||
|
||||
### Strategy 2: Multiple Covariates
|
||||
|
||||
```python
|
||||
# Correct for multiple technical factors
|
||||
scvi.model.SCVI.setup_anndata(
|
||||
adata,
|
||||
batch_key="sequencing_batch",
|
||||
categorical_covariate_keys=["donor", "tissue"],
|
||||
continuous_covariate_keys=["percent_mito"]
|
||||
)
|
||||
```
|
||||
|
||||
### Strategy 3: Hierarchical Batches
|
||||
|
||||
```python
|
||||
# When batches have hierarchical structure
|
||||
# E.g., samples within studies
|
||||
adata.obs["batch_hierarchy"] = (
|
||||
adata.obs["study"].astype(str) + "_" +
|
||||
adata.obs["sample"].astype(str)
|
||||
)
|
||||
|
||||
scvi.model.SCVI.setup_anndata(adata, batch_key="batch_hierarchy")
|
||||
```
|
||||
|
||||
## Reference Mapping (scArches)
|
||||
|
||||
### Training Reference Model
|
||||
|
||||
```python
|
||||
# Train on reference dataset
|
||||
scvi.model.SCVI.setup_anndata(ref_adata, batch_key="batch")
|
||||
ref_model = scvi.model.SCVI(ref_adata)
|
||||
ref_model.train()
|
||||
|
||||
# Save reference
|
||||
ref_model.save("reference_model")
|
||||
```
|
||||
|
||||
### Mapping Query to Reference
|
||||
|
||||
```python
|
||||
# Load reference
|
||||
ref_model = scvi.model.SCVI.load("reference_model", adata=ref_adata)
|
||||
|
||||
# Setup query with same parameters
|
||||
scvi.model.SCVI.setup_anndata(query_adata, batch_key="batch")
|
||||
|
||||
# Transfer learning
|
||||
query_model = scvi.model.SCVI.load_query_data(
|
||||
query_adata,
|
||||
"reference_model"
|
||||
)
|
||||
|
||||
# Fine-tune on query (optional)
|
||||
query_model.train(max_epochs=200)
|
||||
|
||||
# Get query embeddings
|
||||
query_latent = query_model.get_latent_representation()
|
||||
|
||||
# Transfer labels using KNN
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
|
||||
knn = KNeighborsClassifier(n_neighbors=15)
|
||||
knn.fit(ref_model.get_latent_representation(), ref_adata.obs["cell_type"])
|
||||
query_adata.obs["predicted_cell_type"] = knn.predict(query_latent)
|
||||
```
|
||||
|
||||
## Model Minification
|
||||
|
||||
Reduce model size for faster inference:
|
||||
|
||||
```python
|
||||
# Train full model
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# Minify for deployment
|
||||
minified = model.minify_adata(adata)
|
||||
|
||||
# Save minified version
|
||||
minified.write("minified_data.h5ad")
|
||||
model.save("minified_model")
|
||||
|
||||
# Load and use (much faster)
|
||||
mini_model = scvi.model.SCVI.load("minified_model", adata=minified)
|
||||
```
|
||||
|
||||
## Memory-Efficient Data Loading
|
||||
|
||||
### Using AnnDataLoader
|
||||
|
||||
```python
|
||||
from scvi.data import AnnDataLoader
|
||||
|
||||
# For very large datasets
|
||||
dataloader = AnnDataLoader(
|
||||
adata,
|
||||
batch_size=128,
|
||||
shuffle=True,
|
||||
drop_last=False
|
||||
)
|
||||
|
||||
# Custom training loop (advanced)
|
||||
for batch in dataloader:
|
||||
# Process batch
|
||||
pass
|
||||
```
|
||||
|
||||
### Using Backed AnnData
|
||||
|
||||
```python
|
||||
# For data too large for memory
|
||||
adata = sc.read_h5ad("huge_dataset.h5ad", backed='r')
|
||||
|
||||
# scvi-tools works with backed mode
|
||||
scvi.model.SCVI.setup_anndata(adata)
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
```
|
||||
|
||||
## Model Interpretation
|
||||
|
||||
### Feature Importance with SHAP
|
||||
|
||||
```python
|
||||
import shap
|
||||
|
||||
# Get SHAP values for interpretability
|
||||
explainer = shap.DeepExplainer(model.module, background_data)
|
||||
shap_values = explainer.shap_values(test_data)
|
||||
|
||||
# Visualize
|
||||
shap.summary_plot(shap_values, feature_names=adata.var_names)
|
||||
```
|
||||
|
||||
### Gene Correlation Analysis
|
||||
|
||||
```python
|
||||
# Get gene-gene correlation matrix
|
||||
correlation = model.get_feature_correlation_matrix(
|
||||
adata,
|
||||
transform_batch="batch1"
|
||||
)
|
||||
|
||||
# Visualize top correlated genes
|
||||
import seaborn as sns
|
||||
sns.heatmap(correlation[:50, :50], cmap="coolwarm")
|
||||
```
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: NaN Loss During Training
|
||||
|
||||
**Causes**:
|
||||
- Learning rate too high
|
||||
- Unnormalized input (must use raw counts)
|
||||
- Data quality issues
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Reduce learning rate
|
||||
model.train(lr=0.0001)
|
||||
|
||||
# Check data
|
||||
assert adata.X.min() >= 0 # No negative values
|
||||
assert np.isnan(adata.X).sum() == 0 # No NaNs
|
||||
|
||||
# Use more stable likelihood
|
||||
model = scvi.model.SCVI(adata, gene_likelihood="nb")
|
||||
```
|
||||
|
||||
### Issue: Poor Batch Correction
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Increase batch effect modeling
|
||||
model = scvi.model.SCVI(
|
||||
adata,
|
||||
encode_covariates=True, # Encode batch in encoder
|
||||
deeply_inject_covariates=False
|
||||
)
|
||||
|
||||
# Or try opposite
|
||||
model = scvi.model.SCVI(adata, deeply_inject_covariates=True)
|
||||
|
||||
# Use more latent dimensions
|
||||
model = scvi.model.SCVI(adata, n_latent=50)
|
||||
```
|
||||
|
||||
### Issue: Model Not Training (ELBO Not Decreasing)
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Increase learning rate
|
||||
model.train(lr=0.005)
|
||||
|
||||
# Increase network capacity
|
||||
model = scvi.model.SCVI(adata, n_hidden=256, n_layers=2)
|
||||
|
||||
# Train longer
|
||||
model.train(max_epochs=500)
|
||||
```
|
||||
|
||||
### Issue: Out of Memory (OOM)
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Reduce batch size
|
||||
model.train(batch_size=64)
|
||||
|
||||
# Use mixed precision
|
||||
model.train(precision=16)
|
||||
|
||||
# Reduce model size
|
||||
model = scvi.model.SCVI(adata, n_latent=10, n_hidden=64)
|
||||
|
||||
# Use backed AnnData
|
||||
adata = sc.read_h5ad("data.h5ad", backed='r')
|
||||
```
|
||||
|
||||
## Performance Benchmarking
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
# Time training
|
||||
start = time.time()
|
||||
model.train(max_epochs=400)
|
||||
training_time = time.time() - start
|
||||
print(f"Training time: {training_time:.2f}s")
|
||||
|
||||
# Time inference
|
||||
start = time.time()
|
||||
latent = model.get_latent_representation()
|
||||
inference_time = time.time() - start
|
||||
print(f"Inference time: {inference_time:.2f}s")
|
||||
|
||||
# Memory usage
|
||||
import psutil
|
||||
import os
|
||||
process = psutil.Process(os.getpid())
|
||||
memory_gb = process.memory_info().rss / 1024**3
|
||||
print(f"Memory usage: {memory_gb:.2f} GB")
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Always use raw counts**: Never log-normalize before scvi-tools
|
||||
2. **Feature selection**: Use highly variable genes for efficiency
|
||||
3. **Batch correction**: Register all known technical covariates
|
||||
4. **Early stopping**: Use validation set to prevent overfitting
|
||||
5. **Model saving**: Always save trained models
|
||||
6. **GPU usage**: Use GPU for large datasets (>10k cells)
|
||||
7. **Hyperparameter tuning**: Start with defaults, tune if needed
|
||||
8. **Validation**: Check batch correction visually (UMAP colored by batch)
|
||||
9. **Documentation**: Keep track of preprocessing steps
|
||||
10. **Reproducibility**: Set random seeds (`scvi.settings.seed = 0`)
|
||||
Reference in New Issue
Block a user