Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/scvi-tools/references/models-scrna-seq.md
+++ b/skills/scvi-tools/references/models-scrna-seq.md
@@ -0,0 +1,330 @@
+# Single-Cell RNA-seq Models
+
+This document covers core models for analyzing single-cell RNA sequencing data in scvi-tools.
+
+## scVI (Single-Cell Variational Inference)
+
+**Purpose**: Unsupervised analysis, dimensionality reduction, and batch correction for scRNA-seq data.
+
+**Key Features**:
+- Deep generative model based on variational autoencoders (VAE)
+- Learns low-dimensional latent representations that capture biological variation
+- Automatically corrects for batch effects and technical covariates
+- Enables normalized gene expression estimation
+- Supports differential expression analysis
+
+**When to Use**:
+- Initial exploration and dimensionality reduction of scRNA-seq datasets
+- Integrating multiple batches or studies
+- Generating batch-corrected expression matrices
+- Performing probabilistic differential expression analysis
+
+**Basic Usage**:
+```python
+import scvi
+
+# Setup data
+scvi.model.SCVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch"
+)
+
+# Train model
+model = scvi.model.SCVI(adata, n_latent=30)
+model.train()
+
+# Extract results
+latent = model.get_latent_representation()
+normalized = model.get_normalized_expression()
+```
+
+**Key Parameters**:
+- `n_latent`: Dimensionality of latent space (default: 10)
+- `n_layers`: Number of hidden layers (default: 1)
+- `n_hidden`: Number of nodes per hidden layer (default: 128)
+- `dropout_rate`: Dropout rate for neural networks (default: 0.1)
+- `dispersion`: Gene-specific or cell-specific dispersion ("gene" or "gene-batch")
+- `gene_likelihood`: Distribution for data ("zinb", "nb", "poisson")
+
+**Outputs**:
+- `get_latent_representation()`: Batch-corrected low-dimensional embeddings
+- `get_normalized_expression()`: Denoised, normalized expression values
+- `differential_expression()`: Probabilistic DE testing between groups
+- `get_feature_correlation_matrix()`: Gene-gene correlation estimates
+
+## scANVI (Single-Cell ANnotation using Variational Inference)
+
+**Purpose**: Semi-supervised cell type annotation and integration using labeled and unlabeled cells.
+
+**Key Features**:
+- Extends scVI with cell type labels
+- Leverages partially labeled datasets for annotation transfer
+- Performs simultaneous batch correction and cell type prediction
+- Enables query-to-reference mapping
+
+**When to Use**:
+- Annotating new datasets using reference labels
+- Transfer learning from well-annotated to unlabeled datasets
+- Joint analysis of labeled and unlabeled cells
+- Building cell type classifiers with uncertainty quantification
+
+**Basic Usage**:
+```python
+# Option 1: Train from scratch
+scvi.model.SCANVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch",
+    labels_key="cell_type",
+    unlabeled_category="Unknown"
+)
+model = scvi.model.SCANVI(adata)
+model.train()
+
+# Option 2: Initialize from pretrained scVI
+scvi_model = scvi.model.SCVI(adata)
+scvi_model.train()
+scanvi_model = scvi.model.SCANVI.from_scvi_model(
+    scvi_model,
+    unlabeled_category="Unknown"
+)
+scanvi_model.train()
+
+# Predict cell types
+predictions = scanvi_model.predict()
+```
+
+**Key Parameters**:
+- `labels_key`: Column in `adata.obs` containing cell type labels
+- `unlabeled_category`: Label for cells without annotations
+- All scVI parameters are also available
+
+**Outputs**:
+- `predict()`: Cell type predictions for all cells
+- `predict_proba()`: Prediction probabilities
+- `get_latent_representation()`: Cell type-aware latent space
+
+## AUTOZI
+
+**Purpose**: Automatic identification and modeling of zero-inflated genes in scRNA-seq data.
+
+**Key Features**:
+- Distinguishes biological zeros from technical dropout
+- Learns which genes exhibit zero-inflation
+- Provides gene-specific zero-inflation probabilities
+- Improves downstream analysis by accounting for dropout
+
+**When to Use**:
+- Detecting which genes are affected by technical dropout
+- Improving imputation and normalization for sparse datasets
+- Understanding the extent of zero-inflation in your data
+
+**Basic Usage**:
+```python
+scvi.model.AUTOZI.setup_anndata(adata, layer="counts")
+model = scvi.model.AUTOZI(adata)
+model.train()
+
+# Get zero-inflation probabilities per gene
+zi_probs = model.get_alphas_betas()
+```
+
+## VeloVI
+
+**Purpose**: RNA velocity analysis using variational inference.
+
+**Key Features**:
+- Joint modeling of spliced and unspliced RNA counts
+- Probabilistic estimation of RNA velocity
+- Accounts for technical noise and batch effects
+- Provides uncertainty quantification for velocity estimates
+
+**When to Use**:
+- Inferring cellular dynamics and differentiation trajectories
+- Analyzing spliced/unspliced count data
+- RNA velocity analysis with batch correction
+
+**Basic Usage**:
+```python
+import scvelo as scv
+
+# Prepare velocity data
+scv.pp.filter_and_normalize(adata)
+scv.pp.moments(adata)
+
+# Train VeloVI
+scvi.model.VELOVI.setup_anndata(adata, spliced_layer="Ms", unspliced_layer="Mu")
+model = scvi.model.VELOVI(adata)
+model.train()
+
+# Get velocity estimates
+latent_time = model.get_latent_time()
+velocities = model.get_velocity()
+```
+
+## contrastiveVI
+
+**Purpose**: Isolating perturbation-specific variations from background biological variation.
+
+**Key Features**:
+- Separates shared variation (common across conditions) from target-specific variation
+- Useful for perturbation studies (drug treatments, genetic perturbations)
+- Identifies condition-specific gene programs
+- Enables discovery of treatment-specific effects
+
+**When to Use**:
+- Analyzing perturbation experiments (drug screens, CRISPR, etc.)
+- Identifying genes responding specifically to treatments
+- Separating treatment effects from background variation
+- Comparing control vs. perturbed conditions
+
+**Basic Usage**:
+```python
+scvi.model.CONTRASTIVEVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch",
+    categorical_covariate_keys=["condition"]  # control vs treated
+)
+
+model = scvi.model.CONTRASTIVEVI(
+    adata,
+    n_latent=10,        # Shared variation
+    n_latent_target=5   # Target-specific variation
+)
+model.train()
+
+# Extract representations
+shared = model.get_latent_representation(representation="shared")
+target_specific = model.get_latent_representation(representation="target")
+```
+
+## CellAssign
+
+**Purpose**: Marker-based cell type annotation using known marker genes.
+
+**Key Features**:
+- Uses prior knowledge of marker genes for cell types
+- Probabilistic assignment of cells to types
+- Handles marker gene overlap and ambiguity
+- Provides soft assignments with uncertainty
+
+**When to Use**:
+- Annotating cells using known marker genes
+- Leveraging existing biological knowledge for classification
+- Cases where marker gene lists are available but reference datasets are not
+
+**Basic Usage**:
+```python
+# Create marker gene matrix (cell types x genes)
+marker_gene_mat = pd.DataFrame({
+    "CD4 T cells": [1, 1, 0, 0],  # CD3D, CD4, CD8A, CD19
+    "CD8 T cells": [1, 0, 1, 0],
+    "B cells": [0, 0, 0, 1]
+}, index=["CD3D", "CD4", "CD8A", "CD19"])
+
+scvi.model.CELLASSIGN.setup_anndata(adata, layer="counts")
+model = scvi.model.CELLASSIGN(adata, marker_gene_mat)
+model.train()
+
+predictions = model.predict()
+```
+
+## Solo (Doublet Detection)
+
+**Purpose**: Identifying doublets (cells containing two or more cells) in scRNA-seq data.
+
+**Key Features**:
+- Semi-supervised doublet detection using scVI embeddings
+- Simulates artificial doublets for training
+- Provides doublet probability scores
+- Can be applied to any scVI model
+
+**When to Use**:
+- Quality control of scRNA-seq datasets
+- Removing doublets before downstream analysis
+- Assessing doublet rates in your data
+
+**Basic Usage**:
+```python
+# Train scVI model first
+scvi.model.SCVI.setup_anndata(adata, layer="counts")
+scvi_model = scvi.model.SCVI(adata)
+scvi_model.train()
+
+# Train Solo for doublet detection
+solo_model = scvi.external.SOLO.from_scvi_model(scvi_model)
+solo_model.train()
+
+# Predict doublets
+predictions = solo_model.predict()
+doublet_scores = predictions["doublet"]
+adata.obs["doublet_score"] = doublet_scores
+```
+
+## Amortized LDA (Topic Modeling)
+
+**Purpose**: Topic modeling for gene expression using Latent Dirichlet Allocation.
+
+**Key Features**:
+- Discovers gene expression programs (topics)
+- Amortized variational inference for scalability
+- Each cell is a mixture of topics
+- Each topic is a distribution over genes
+
+**When to Use**:
+- Discovering gene programs or expression modules
+- Understanding compositional structure of expression
+- Alternative dimensionality reduction approach
+- Interpretable decomposition of expression patterns
+
+**Basic Usage**:
+```python
+scvi.model.AMORTIZEDLDA.setup_anndata(adata, layer="counts")
+model = scvi.model.AMORTIZEDLDA(adata, n_topics=10)
+model.train()
+
+# Get topic compositions per cell
+topic_proportions = model.get_latent_representation()
+
+# Get gene loadings per topic
+topic_gene_loadings = model.get_topic_distribution()
+```
+
+## Model Selection Guidelines
+
+**Choose scVI when**:
+- Starting with unsupervised analysis
+- Need batch correction and integration
+- Want normalized expression and DE analysis
+
+**Choose scANVI when**:
+- Have some labeled cells for training
+- Need cell type annotation
+- Want to transfer labels from reference to query
+
+**Choose AUTOZI when**:
+- Concerned about technical dropout
+- Need to identify zero-inflated genes
+- Working with very sparse datasets
+
+**Choose VeloVI when**:
+- Have spliced/unspliced count data
+- Interested in cellular dynamics
+- Need RNA velocity with batch correction
+
+**Choose contrastiveVI when**:
+- Analyzing perturbation experiments
+- Need to separate treatment effects
+- Want to identify condition-specific programs
+
+**Choose CellAssign when**:
+- Have marker gene lists available
+- Want probabilistic marker-based annotation
+- No reference dataset available
+
+**Choose Solo when**:
+- Need doublet detection
+- Already using scVI for analysis
+- Want probabilistic doublet scores