Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/scvi-tools/SKILL.md
+++ b/skills/scvi-tools/SKILL.md
@@ -0,0 +1,184 @@
+---
+name: scvi-tools
+description: This skill should be used when working with single-cell omics data analysis using scvi-tools, including scRNA-seq, scATAC-seq, CITE-seq, spatial transcriptomics, and other single-cell modalities. Use this skill for probabilistic modeling, batch correction, dimensionality reduction, differential expression, cell type annotation, multimodal integration, and spatial analysis tasks.
+---
+
+# scvi-tools
+
+## Overview
+
+scvi-tools is a comprehensive Python framework for probabilistic models in single-cell genomics. Built on PyTorch and PyTorch Lightning, it provides deep generative models using variational inference for analyzing diverse single-cell data modalities.
+
+## When to Use This Skill
+
+Use this skill when:
+- Analyzing single-cell RNA-seq data (dimensionality reduction, batch correction, integration)
+- Working with single-cell ATAC-seq or chromatin accessibility data
+- Integrating multimodal data (CITE-seq, multiome, paired/unpaired datasets)
+- Analyzing spatial transcriptomics data (deconvolution, spatial mapping)
+- Performing differential expression analysis on single-cell data
+- Conducting cell type annotation or transfer learning tasks
+- Working with specialized single-cell modalities (methylation, cytometry, RNA velocity)
+- Building custom probabilistic models for single-cell analysis
+
+## Core Capabilities
+
+scvi-tools provides models organized by data modality:
+
+### 1. Single-Cell RNA-seq Analysis
+Core models for expression analysis, batch correction, and integration. See `references/models-scrna-seq.md` for:
+- **scVI**: Unsupervised dimensionality reduction and batch correction
+- **scANVI**: Semi-supervised cell type annotation and integration
+- **AUTOZI**: Zero-inflation detection and modeling
+- **VeloVI**: RNA velocity analysis
+- **contrastiveVI**: Perturbation effect isolation
+
+### 2. Chromatin Accessibility (ATAC-seq)
+Models for analyzing single-cell chromatin data. See `references/models-atac-seq.md` for:
+- **PeakVI**: Peak-based ATAC-seq analysis and integration
+- **PoissonVI**: Quantitative fragment count modeling
+- **scBasset**: Deep learning approach with motif analysis
+
+### 3. Multimodal & Multi-omics Integration
+Joint analysis of multiple data types. See `references/models-multimodal.md` for:
+- **totalVI**: CITE-seq protein and RNA joint modeling
+- **MultiVI**: Paired and unpaired multi-omic integration
+- **MrVI**: Multi-resolution cross-sample analysis
+
+### 4. Spatial Transcriptomics
+Spatially-resolved transcriptomics analysis. See `references/models-spatial.md` for:
+- **DestVI**: Multi-resolution spatial deconvolution
+- **Stereoscope**: Cell type deconvolution
+- **Tangram**: Spatial mapping and integration
+- **scVIVA**: Cell-environment relationship analysis
+
+### 5. Specialized Modalities
+Additional specialized analysis tools. See `references/models-specialized.md` for:
+- **MethylVI/MethylANVI**: Single-cell methylation analysis
+- **CytoVI**: Flow/mass cytometry batch correction
+- **Solo**: Doublet detection
+- **CellAssign**: Marker-based cell type annotation
+
+## Typical Workflow
+
+All scvi-tools models follow a consistent API pattern:
+
+```python
+# 1. Load and preprocess data (AnnData format)
+import scvi
+import scanpy as sc
+
+adata = scvi.data.heart_cell_atlas_subsampled()
+sc.pp.filter_genes(adata, min_counts=3)
+sc.pp.highly_variable_genes(adata, n_top_genes=1200)
+
+# 2. Register data with model (specify layers, covariates)
+scvi.model.SCVI.setup_anndata(
+    adata,
+    layer="counts",  # Use raw counts, not log-normalized
+    batch_key="batch",
+    categorical_covariate_keys=["donor"],
+    continuous_covariate_keys=["percent_mito"]
+)
+
+# 3. Create and train model
+model = scvi.model.SCVI(adata)
+model.train()
+
+# 4. Extract latent representations and normalized values
+latent = model.get_latent_representation()
+normalized = model.get_normalized_expression(library_size=1e4)
+
+# 5. Store in AnnData for downstream analysis
+adata.obsm["X_scVI"] = latent
+adata.layers["scvi_normalized"] = normalized
+
+# 6. Downstream analysis with scanpy
+sc.pp.neighbors(adata, use_rep="X_scVI")
+sc.tl.umap(adata)
+sc.tl.leiden(adata)
+```
+
+**Key Design Principles:**
+- **Raw counts required**: Models expect unnormalized count data for optimal performance
+- **Unified API**: Consistent interface across all models (setup → train → extract)
+- **AnnData-centric**: Seamless integration with the scanpy ecosystem
+- **GPU acceleration**: Automatic utilization of available GPUs
+- **Batch correction**: Handle technical variation through covariate registration
+
+## Common Analysis Tasks
+
+### Differential Expression
+Probabilistic DE analysis using the learned generative models:
+
+```python
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="TypeA",
+    group2="TypeB",
+    mode="change",  # Use composite hypothesis testing
+    delta=0.25      # Minimum effect size threshold
+)
+```
+
+See `references/differential-expression.md` for detailed methodology and interpretation.
+
+### Model Persistence
+Save and load trained models:
+
+```python
+# Save model
+model.save("./model_directory", overwrite=True)
+
+# Load model
+model = scvi.model.SCVI.load("./model_directory", adata=adata)
+```
+
+### Batch Correction and Integration
+Integrate datasets across batches or studies:
+
+```python
+# Register batch information
+scvi.model.SCVI.setup_anndata(adata, batch_key="study")
+
+# Model automatically learns batch-corrected representations
+model = scvi.model.SCVI(adata)
+model.train()
+latent = model.get_latent_representation()  # Batch-corrected
+```
+
+## Theoretical Foundations
+
+scvi-tools is built on:
+- **Variational inference**: Approximate posterior distributions for scalable Bayesian inference
+- **Deep generative models**: VAE architectures that learn complex data distributions
+- **Amortized inference**: Shared neural networks for efficient learning across cells
+- **Probabilistic modeling**: Principled uncertainty quantification and statistical testing
+
+See `references/theoretical-foundations.md` for detailed background on the mathematical framework.
+
+## Additional Resources
+
+- **Workflows**: `references/workflows.md` contains common workflows, best practices, hyperparameter tuning, and GPU optimization
+- **Model References**: Detailed documentation for each model category in the `references/` directory
+- **Official Documentation**: https://docs.scvi-tools.org/en/stable/
+- **Tutorials**: https://docs.scvi-tools.org/en/stable/tutorials/index.html
+- **API Reference**: https://docs.scvi-tools.org/en/stable/api/index.html
+
+## Installation
+
+```bash
+uv pip install scvi-tools
+# For GPU support
+uv pip install scvi-tools[cuda]
+```
+
+## Best Practices
+
+1. **Use raw counts**: Always provide unnormalized count data to models
+2. **Filter genes**: Remove low-count genes before analysis (e.g., `min_counts=3`)
+3. **Register covariates**: Include known technical factors (batch, donor, etc.) in `setup_anndata`
+4. **Feature selection**: Use highly variable genes for improved performance
+5. **Model saving**: Always save trained models to avoid retraining
+6. **GPU usage**: Enable GPU acceleration for large datasets (`accelerator="gpu"`)
+7. **Scanpy integration**: Store outputs in AnnData objects for downstream analysis
--- a/skills/scvi-tools/references/differential-expression.md
+++ b/skills/scvi-tools/references/differential-expression.md
@@ -0,0 +1,581 @@
+# Differential Expression Analysis in scvi-tools
+
+This document provides detailed information about differential expression (DE) analysis using scvi-tools' probabilistic framework.
+
+## Overview
+
+scvi-tools implements Bayesian differential expression testing that leverages the learned generative models to estimate expression differences between groups. This approach provides several advantages over traditional methods:
+
+- **Batch correction**: DE testing on batch-corrected representations
+- **Uncertainty quantification**: Probabilistic estimates of effect sizes
+- **Zero-inflation handling**: Proper modeling of dropout and zeros
+- **Flexible comparisons**: Between any groups or cell types
+- **Multiple modalities**: Works for RNA, proteins (totalVI), and accessibility (PeakVI)
+
+## Core Statistical Framework
+
+### Problem Definition
+
+The goal is to estimate the log fold-change in expression between two conditions:
+
+```
+log fold-change = log(μ_B) - log(μ_A)
+```
+
+Where μ_A and μ_B are the mean expression levels in conditions A and B.
+
+### Three-Stage Process
+
+**Stage 1: Estimating Expression Levels**
+- Sample from posterior distribution of cellular states
+- Generate expression values from the learned generative model
+- Aggregate across cells to get population-level estimates
+
+**Stage 2: Detecting Relevant Features (Hypothesis Testing)**
+- Test for differential expression using Bayesian framework
+- Two testing modes available:
+  - **"vanilla" mode**: Point null hypothesis (β = 0)
+  - **"change" mode**: Composite hypothesis (|β| ≤ δ)
+
+**Stage 3: Controlling False Discovery**
+- Posterior expected False Discovery Proportion (FDP) control
+- Selects maximum number of discoveries ensuring E[FDP] ≤ α
+
+## Basic Usage
+
+### Simple Two-Group Comparison
+
+```python
+import scvi
+
+# After training a model
+model = scvi.model.SCVI(adata)
+model.train()
+
+# Compare two cell types
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells"
+)
+
+# View top DE genes
+top_genes = de_results.sort_values("lfc_mean", ascending=False).head(20)
+print(top_genes[["lfc_mean", "lfc_std", "bayes_factor", "is_de_fdr_0.05"]])
+```
+
+### One vs. Rest Comparison
+
+```python
+# Compare one group against all others
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells"  # No group2 = compare to rest
+)
+```
+
+### All Pairwise Comparisons
+
+```python
+# Compare all cell types pairwise
+all_comparisons = {}
+
+cell_types = adata.obs["cell_type"].unique()
+
+for ct1 in cell_types:
+    for ct2 in cell_types:
+        if ct1 != ct2:
+            key = f"{ct1}_vs_{ct2}"
+            all_comparisons[key] = model.differential_expression(
+                groupby="cell_type",
+                group1=ct1,
+                group2=ct2
+            )
+```
+
+## Key Parameters
+
+### `groupby` (required)
+Column in `adata.obs` defining groups to compare.
+
+```python
+# Must be a categorical variable
+de_results = model.differential_expression(groupby="cell_type")
+```
+
+### `group1` and `group2`
+Groups to compare. If `group2` is None, compares `group1` to all others.
+
+```python
+# Specific comparison
+de = model.differential_expression(groupby="condition", group1="treated", group2="control")
+
+# One vs rest
+de = model.differential_expression(groupby="cell_type", group1="T cells")
+```
+
+### `mode` (Hypothesis Testing Mode)
+
+**"vanilla" mode** (default): Point null hypothesis
+- Tests if β = 0 exactly
+- More sensitive, but may find trivially small effects
+
+**"change" mode**: Composite null hypothesis
+- Tests if |β| ≤ δ
+- Requires biologically meaningful change
+- Reduces false discoveries of tiny effects
+
+```python
+# Change mode with minimum effect size
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    mode="change",
+    delta=0.25  # Minimum log fold-change
+)
+```
+
+### `delta`
+Minimum effect size threshold for "change" mode.
+- Typical values: 0.25, 0.5, 0.7 (log scale)
+- log2(1.5) ≈ 0.58 (1.5-fold change)
+- log2(2) = 1.0 (2-fold change)
+
+```python
+# Require at least 1.5-fold change
+de = model.differential_expression(
+    groupby="condition",
+    group1="disease",
+    group2="healthy",
+    mode="change",
+    delta=0.58  # log2(1.5)
+)
+```
+
+### `fdr_target`
+False discovery rate threshold (default: 0.05)
+
+```python
+# More stringent FDR control
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    fdr_target=0.01
+)
+```
+
+### `batch_correction`
+Whether to perform batch correction during DE testing (default: True)
+
+```python
+# Test within a specific batch
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    batch_correction=False
+)
+```
+
+### `n_samples`
+Number of posterior samples for estimation (default: 5000)
+- More samples = more accurate but slower
+- Reduce for speed, increase for precision
+
+```python
+# High precision analysis
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    n_samples=10000
+)
+```
+
+## Interpreting Results
+
+### Output Columns
+
+The results DataFrame contains several important columns:
+
+**Effect Size Estimates**:
+- `lfc_mean`: Mean log fold-change
+- `lfc_median`: Median log fold-change
+- `lfc_std`: Standard deviation of log fold-change
+- `lfc_min`: Lower bound of effect size
+- `lfc_max`: Upper bound of effect size
+
+**Statistical Significance**:
+- `bayes_factor`: Bayes factor for differential expression
+  - Higher values = stronger evidence
+  - >3 often considered meaningful
+- `is_de_fdr_0.05`: Boolean indicating if gene is DE at FDR 0.05
+- `is_de_fdr_0.1`: Boolean indicating if gene is DE at FDR 0.1
+
+**Expression Levels**:
+- `mean1`: Mean expression in group 1
+- `mean2`: Mean expression in group 2
+- `non_zeros_proportion1`: Proportion of non-zero cells in group 1
+- `non_zeros_proportion2`: Proportion of non-zero cells in group 2
+
+### Example Interpretation
+
+```python
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells"
+)
+
+# Find significantly upregulated genes in T cells
+upreg_tcells = de_results[
+    (de_results["is_de_fdr_0.05"]) &
+    (de_results["lfc_mean"] > 0)
+].sort_values("lfc_mean", ascending=False)
+
+print(f"Upregulated genes in T cells: {len(upreg_tcells)}")
+print(upreg_tcells.head(10))
+
+# Find genes with large effect sizes
+large_effect = de_results[
+    (de_results["is_de_fdr_0.05"]) &
+    (abs(de_results["lfc_mean"]) > 1)  # 2-fold change
+]
+```
+
+## Advanced Usage
+
+### DE Within Specific Cells
+
+```python
+# Test DE only within a subset of cells
+subset_indices = adata.obs["tissue"] == "lung"
+
+de = model.differential_expression(
+    idx1=adata.obs["cell_type"] == "T cells" & subset_indices,
+    idx2=adata.obs["cell_type"] == "B cells" & subset_indices
+)
+```
+
+### Batch-Specific DE
+
+```python
+# Test DE within each batch separately
+batches = adata.obs["batch"].unique()
+
+batch_de_results = {}
+for batch in batches:
+    batch_idx = adata.obs["batch"] == batch
+    batch_de_results[batch] = model.differential_expression(
+        idx1=(adata.obs["condition"] == "treated") & batch_idx,
+        idx2=(adata.obs["condition"] == "control") & batch_idx
+    )
+```
+
+### Pseudo-bulk DE
+
+```python
+# Aggregate cells before DE testing
+# Useful for low cell counts per group
+
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="rare_cell_type",
+    group2="common_cell_type",
+    n_samples=10000,  # More samples for stability
+    batch_correction=True
+)
+```
+
+## Visualization
+
+### Volcano Plot
+
+```python
+import matplotlib.pyplot as plt
+import numpy as np
+
+de = model.differential_expression(
+    groupby="condition",
+    group1="treated",
+    group2="control"
+)
+
+# Volcano plot
+plt.figure(figsize=(10, 6))
+plt.scatter(
+    de["lfc_mean"],
+    -np.log10(1 / (de["bayes_factor"] + 1)),
+    c=de["is_de_fdr_0.05"],
+    cmap="coolwarm",
+    alpha=0.5
+)
+plt.xlabel("Log Fold Change")
+plt.ylabel("-log10(1/Bayes Factor)")
+plt.title("Volcano Plot: Treated vs Control")
+plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
+plt.show()
+```
+
+### Heatmap of Top DE Genes
+
+```python
+import seaborn as sns
+
+# Get top DE genes
+top_genes = de.sort_values("lfc_mean", ascending=False).head(50).index
+
+# Get normalized expression
+norm_expr = model.get_normalized_expression(
+    adata,
+    indices=adata.obs["condition"].isin(["treated", "control"]),
+    gene_list=top_genes
+)
+
+# Plot heatmap
+plt.figure(figsize=(12, 10))
+sns.heatmap(
+    norm_expr.T,
+    cmap="viridis",
+    xticklabels=False,
+    yticklabels=top_genes
+)
+plt.title("Top 50 DE Genes")
+plt.show()
+```
+
+### Ranked Gene Plot
+
+```python
+# Plot genes ranked by effect size
+de_sorted = de.sort_values("lfc_mean", ascending=False)
+
+plt.figure(figsize=(12, 6))
+plt.plot(range(len(de_sorted)), de_sorted["lfc_mean"].values)
+plt.axhline(y=0, color='r', linestyle='--')
+plt.xlabel("Gene Rank")
+plt.ylabel("Log Fold Change")
+plt.title("Genes Ranked by Effect Size")
+plt.show()
+```
+
+## Comparison with Traditional Methods
+
+### scvi-tools vs. Wilcoxon Test
+
+```python
+import scanpy as sc
+
+# Traditional Wilcoxon test
+sc.tl.rank_genes_groups(
+    adata,
+    groupby="cell_type",
+    method="wilcoxon",
+    key_added="wilcoxon"
+)
+
+# scvi-tools DE
+de_scvi = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells"
+)
+
+# Compare results
+wilcox_results = sc.get.rank_genes_groups_df(adata, group="T cells", key="wilcoxon")
+```
+
+**Advantages of scvi-tools**:
+- Accounts for batch effects automatically
+- Handles zero-inflation properly
+- Provides uncertainty quantification
+- No arbitrary pseudocount needed
+- Better statistical properties
+
+**When to use Wilcoxon**:
+- Very quick exploratory analysis
+- Comparison with published results using Wilcoxon
+
+## Multi-Modal DE
+
+### Protein DE (totalVI)
+
+```python
+# Train totalVI on CITE-seq data
+totalvi_model = scvi.model.TOTALVI(adata)
+totalvi_model.train()
+
+# RNA differential expression
+rna_de = totalvi_model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    protein_expression=False  # Default
+)
+
+# Protein differential expression
+protein_de = totalvi_model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    protein_expression=True
+)
+
+print(f"DE genes: {rna_de['is_de_fdr_0.05'].sum()}")
+print(f"DE proteins: {protein_de['is_de_fdr_0.05'].sum()}")
+```
+
+### Differential Accessibility (PeakVI)
+
+```python
+# Train PeakVI on ATAC-seq data
+peakvi_model = scvi.model.PEAKVI(atac_adata)
+peakvi_model.train()
+
+# Differential accessibility
+da = peakvi_model.differential_accessibility(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells"
+)
+
+# Same interpretation as DE
+```
+
+## Handling Special Cases
+
+### Low Cell Count Groups
+
+```python
+# Increase posterior samples for stability
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="rare_type",  # e.g., 50 cells
+    group2="common_type",  # e.g., 5000 cells
+    n_samples=10000
+)
+```
+
+### Imbalanced Comparisons
+
+```python
+# When groups have very different sizes
+# Use change mode to avoid tiny effects
+
+de = model.differential_expression(
+    groupby="condition",
+    group1="rare_condition",
+    group2="common_condition",
+    mode="change",
+    delta=0.5
+)
+```
+
+### Multiple Testing Correction
+
+```python
+# Already included via FDP control
+# But can apply additional corrections
+
+from statsmodels.stats.multitest import multipletests
+
+# Bonferroni correction (very conservative)
+_, pvals_corrected, _, _ = multipletests(
+    1 / (de["bayes_factor"] + 1),
+    method="bonferroni"
+)
+```
+
+## Performance Considerations
+
+### Speed Optimization
+
+```python
+# Faster DE testing for large datasets
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    n_samples=1000,  # Reduce samples
+    batch_size=512    # Increase batch size
+)
+```
+
+### Memory Management
+
+```python
+# For very large datasets
+# Test one comparison at a time rather than all pairwise
+
+cell_types = adata.obs["cell_type"].unique()
+for ct in cell_types:
+    de = model.differential_expression(
+        groupby="cell_type",
+        group1=ct
+    )
+    # Save results
+    de.to_csv(f"de_results_{ct}.csv")
+```
+
+## Best Practices
+
+1. **Use "change" mode**: For biologically interpretable results
+2. **Set appropriate delta**: Based on biological significance
+3. **Check expression levels**: Filter lowly expressed genes
+4. **Validate findings**: Check marker genes for sanity
+5. **Visualize results**: Always plot top DE genes
+6. **Report parameters**: Document mode, delta, FDR used
+7. **Consider batch effects**: Use batch_correction=True
+8. **Multiple comparisons**: Be aware of testing many groups
+9. **Sample size**: Ensure sufficient cells per group (>50 recommended)
+10. **Biological validation**: Follow up with functional experiments
+
+## Example: Complete DE Analysis Workflow
+
+```python
+import scvi
+import scanpy as sc
+import matplotlib.pyplot as plt
+
+# 1. Train model
+scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
+model = scvi.model.SCVI(adata)
+model.train()
+
+# 2. Perform DE analysis
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="Disease_T_cells",
+    group2="Healthy_T_cells",
+    mode="change",
+    delta=0.5,
+    fdr_target=0.05
+)
+
+# 3. Filter and analyze
+sig_genes = de_results[de_results["is_de_fdr_0.05"]]
+upreg = sig_genes[sig_genes["lfc_mean"] > 0].sort_values("lfc_mean", ascending=False)
+downreg = sig_genes[sig_genes["lfc_mean"] < 0].sort_values("lfc_mean")
+
+print(f"Significant genes: {len(sig_genes)}")
+print(f"Upregulated: {len(upreg)}")
+print(f"Downregulated: {len(downreg)}")
+
+# 4. Visualize top genes
+top_genes = upreg.head(10).index.tolist() + downreg.head(10).index.tolist()
+
+sc.pl.violin(
+    adata[adata.obs["cell_type"].isin(["Disease_T_cells", "Healthy_T_cells"])],
+    keys=top_genes,
+    groupby="cell_type",
+    rotation=90
+)
+
+# 5. Functional enrichment (using external tools)
+# E.g., g:Profiler, DAVID, or gprofiler-official Python package
+upreg_genes = upreg.head(100).index.tolist()
+# Perform pathway analysis...
+
+# 6. Save results
+de_results.to_csv("de_results_disease_vs_healthy.csv")
+upreg.to_csv("upregulated_genes.csv")
+downreg.to_csv("downregulated_genes.csv")
+```
--- a/skills/scvi-tools/references/models-atac-seq.md
+++ b/skills/scvi-tools/references/models-atac-seq.md
@@ -0,0 +1,321 @@
+# ATAC-seq and Chromatin Accessibility Models
+
+This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.
+
+## PeakVI
+
+**Purpose**: Analysis and integration of single-cell ATAC-seq data using peak counts.
+
+**Key Features**:
+- Variational autoencoder specifically designed for scATAC-seq peak data
+- Learns low-dimensional representations of chromatin accessibility
+- Performs batch correction across samples
+- Enables differential accessibility testing
+- Integrates multiple ATAC-seq datasets
+
+**When to Use**:
+- Analyzing scATAC-seq peak count matrices
+- Integrating multiple ATAC-seq experiments
+- Batch correction of chromatin accessibility data
+- Dimensionality reduction for ATAC-seq
+- Differential accessibility analysis between cell types or conditions
+
+**Data Requirements**:
+- Peak count matrix (cells × peaks)
+- Binary or count data for peak accessibility
+- Batch/sample annotations (optional, for batch correction)
+
+**Basic Usage**:
+```python
+import scvi
+
+# Prepare data (peaks should be in adata.X)
+# Optional: filter peaks
+sc.pp.filter_genes(adata, min_cells=3)
+
+# Setup data
+scvi.model.PEAKVI.setup_anndata(
+    adata,
+    batch_key="batch"
+)
+
+# Train model
+model = scvi.model.PEAKVI(adata)
+model.train()
+
+# Get latent representation (batch-corrected)
+latent = model.get_latent_representation()
+adata.obsm["X_PeakVI"] = latent
+
+# Differential accessibility
+da_results = model.differential_accessibility(
+    groupby="cell_type",
+    group1="TypeA",
+    group2="TypeB"
+)
+```
+
+**Key Parameters**:
+- `n_latent`: Dimensionality of latent space (default: 10)
+- `n_hidden`: Number of nodes per hidden layer (default: 128)
+- `n_layers`: Number of hidden layers (default: 1)
+- `region_factors`: Whether to learn region-specific factors (default: True)
+- `latent_distribution`: Distribution for latent space ("normal" or "ln")
+
+**Outputs**:
+- `get_latent_representation()`: Low-dimensional embeddings for cells
+- `get_accessibility_estimates()`: Normalized accessibility values
+- `differential_accessibility()`: Statistical testing for differential peaks
+- `get_region_factors()`: Peak-specific scaling factors
+
+**Best Practices**:
+1. Filter out low-quality peaks (present in very few cells)
+2. Include batch information if integrating multiple samples
+3. Use latent representations for clustering and UMAP visualization
+4. Consider using `region_factors=True` for datasets with high technical variation
+5. Store latent embeddings in `adata.obsm` for downstream analysis with scanpy
+
+## PoissonVI
+
+**Purpose**: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).
+
+**Key Features**:
+- Models fragment counts directly (not just peak presence/absence)
+- Poisson distribution for count data
+- Captures quantitative differences in accessibility
+- Enables fine-grained analysis of chromatin state
+
+**When to Use**:
+- Analyzing fragment-level ATAC-seq data
+- Need quantitative accessibility measurements
+- Higher resolution analysis than binary peak calls
+- Investigating gradual changes in chromatin accessibility
+
+**Data Requirements**:
+- Fragment count matrix (cells × genomic regions)
+- Count data (not binary)
+
+**Basic Usage**:
+```python
+scvi.model.POISSONVI.setup_anndata(
+    adata,
+    batch_key="batch"
+)
+
+model = scvi.model.POISSONVI(adata)
+model.train()
+
+# Get results
+latent = model.get_latent_representation()
+accessibility = model.get_accessibility_estimates()
+```
+
+**Key Differences from PeakVI**:
+- **PeakVI**: Best for standard peak count matrices, faster
+- **PoissonVI**: Best for quantitative fragment counts, more detailed
+
+**When to Choose PoissonVI over PeakVI**:
+- Working with fragment counts rather than called peaks
+- Need to capture quantitative differences
+- Have high-quality, high-coverage data
+- Interested in subtle accessibility changes
+
+## scBasset
+
+**Purpose**: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.
+
+**Key Features**:
+- Convolutional neural network (CNN) architecture for sequence-based analysis
+- Models raw DNA sequences, not just peak counts
+- Enables motif discovery and transcription factor (TF) binding prediction
+- Provides interpretable feature importance
+- Performs batch correction
+
+**When to Use**:
+- Want to incorporate DNA sequence information
+- Interested in TF motif analysis
+- Need interpretable models (which sequences drive accessibility)
+- Analyzing regulatory elements and TF binding sites
+- Predicting accessibility from sequence alone
+
+**Data Requirements**:
+- Peak sequences (extracted from genome)
+- Peak accessibility matrix
+- Genome reference (for sequence extraction)
+
+**Basic Usage**:
+```python
+# scBasset requires sequence information
+# First, extract sequences for peaks
+from scbasset import utils
+sequences = utils.fetch_sequences(adata, genome="hg38")
+
+# Setup and train
+scvi.model.SCBASSET.setup_anndata(
+    adata,
+    batch_key="batch"
+)
+
+model = scvi.model.SCBASSET(adata, sequences=sequences)
+model.train()
+
+# Get latent representation
+latent = model.get_latent_representation()
+
+# Interpret model: which sequences/motifs are important
+importance_scores = model.get_feature_importance()
+```
+
+**Key Parameters**:
+- `n_latent`: Latent space dimensionality
+- `conv_layers`: Number of convolutional layers
+- `n_filters`: Number of filters per conv layer
+- `filter_size`: Size of convolutional filters
+
+**Advanced Features**:
+- **In silico mutagenesis**: Predict how sequence changes affect accessibility
+- **Motif enrichment**: Identify enriched TF motifs in accessible regions
+- **Batch correction**: Similar to other scvi-tools models
+- **Transfer learning**: Fine-tune on new datasets
+
+**Interpretability Tools**:
+```python
+# Get importance scores for sequences
+importance = model.get_sequence_importance(region_indices=[0, 1, 2])
+
+# Predict accessibility for new sequences
+predictions = model.predict_accessibility(new_sequences)
+```
+
+## Model Selection for ATAC-seq
+
+### PeakVI
+**Choose when**:
+- Standard scATAC-seq analysis workflow
+- Have peak count matrices (most common format)
+- Need fast, efficient batch correction
+- Want straightforward differential accessibility
+- Prioritize computational efficiency
+
+**Advantages**:
+- Fast training and inference
+- Proven track record for scATAC-seq
+- Easy integration with scanpy workflow
+- Robust batch correction
+
+### PoissonVI
+**Choose when**:
+- Have fragment-level count data
+- Need quantitative accessibility measures
+- Interested in subtle differences
+- Have high-coverage, high-quality data
+
+**Advantages**:
+- More detailed quantitative information
+- Better for gradient changes
+- Appropriate statistical model for counts
+
+### scBasset
+**Choose when**:
+- Want to incorporate DNA sequence
+- Need biological interpretation (motifs, TFs)
+- Interested in regulatory mechanisms
+- Have computational resources for CNN training
+- Want predictive power for new sequences
+
+**Advantages**:
+- Sequence-based, biologically interpretable
+- Motif and TF analysis built-in
+- Predictive modeling capabilities
+- In silico perturbation experiments
+
+## Workflow Example: Complete ATAC-seq Analysis
+
+```python
+import scvi
+import scanpy as sc
+
+# 1. Load and preprocess ATAC-seq data
+adata = sc.read_h5ad("atac_data.h5ad")
+
+# 2. Filter low-quality peaks
+sc.pp.filter_genes(adata, min_cells=10)
+
+# 3. Setup and train PeakVI
+scvi.model.PEAKVI.setup_anndata(
+    adata,
+    batch_key="sample"
+)
+
+model = scvi.model.PEAKVI(adata, n_latent=20)
+model.train(max_epochs=400)
+
+# 4. Extract latent representation
+latent = model.get_latent_representation()
+adata.obsm["X_PeakVI"] = latent
+
+# 5. Downstream analysis
+sc.pp.neighbors(adata, use_rep="X_PeakVI")
+sc.tl.umap(adata)
+sc.tl.leiden(adata, key_added="clusters")
+
+# 6. Differential accessibility
+da_results = model.differential_accessibility(
+    groupby="clusters",
+    group1="0",
+    group2="1"
+)
+
+# 7. Save model
+model.save("peakvi_model")
+```
+
+## Integration with Gene Expression (RNA+ATAC)
+
+For paired multimodal data (RNA+ATAC from same cells), use **MultiVI** instead:
+
+```python
+# For 10x Multiome or similar paired data
+scvi.model.MULTIVI.setup_anndata(
+    adata,
+    batch_key="sample",
+    modality_key="modality"  # "RNA" or "ATAC"
+)
+
+model = scvi.model.MULTIVI(adata)
+model.train()
+
+# Get joint latent space
+latent = model.get_latent_representation()
+```
+
+See `models-multimodal.md` for more details on multimodal integration.
+
+## Best Practices for ATAC-seq Analysis
+
+1. **Quality Control**:
+   - Filter cells with very low or very high peak counts
+   - Remove peaks present in very few cells
+   - Filter mitochondrial and sex chromosome peaks if needed
+
+2. **Batch Correction**:
+   - Always include `batch_key` if integrating multiple samples
+   - Consider technical covariates (sequencing depth, TSS enrichment)
+
+3. **Feature Selection**:
+   - Unlike RNA-seq, all peaks are often used
+   - Consider filtering very rare peaks for efficiency
+
+4. **Latent Dimensions**:
+   - Start with `n_latent=10-30` depending on dataset complexity
+   - Larger values for more heterogeneous datasets
+
+5. **Downstream Analysis**:
+   - Use latent representations for clustering and visualization
+   - Link peaks to genes for regulatory analysis
+   - Perform motif enrichment on cluster-specific peaks
+
+6. **Computational Considerations**:
+   - ATAC-seq matrices are often very large (many peaks)
+   - Consider downsampling peaks for initial exploration
+   - Use GPU acceleration for large datasets
--- a/skills/scvi-tools/references/models-multimodal.md
+++ b/skills/scvi-tools/references/models-multimodal.md
@@ -0,0 +1,367 @@
+# Multimodal and Multi-omics Integration Models
+
+This document covers models for joint analysis of multiple data modalities in scvi-tools.
+
+## totalVI (Total Variational Inference)
+
+**Purpose**: Joint analysis of CITE-seq data (simultaneous RNA and protein measurements from same cells).
+
+**Key Features**:
+- Jointly models gene expression and protein abundance
+- Learns shared low-dimensional representations
+- Enables protein imputation from RNA data
+- Performs differential expression for both modalities
+- Handles batch effects in both RNA and protein layers
+
+**When to Use**:
+- Analyzing CITE-seq or REAP-seq data
+- Joint RNA + surface protein measurements
+- Imputing missing proteins
+- Integrating protein and RNA information
+- Multi-batch CITE-seq integration
+
+**Data Requirements**:
+- AnnData with gene expression in `.X` or a layer
+- Protein measurements in `.obsm["protein_expression"]`
+- Same cells measured for both modalities
+
+**Basic Usage**:
+```python
+import scvi
+
+# Setup data - specify both RNA and protein layers
+scvi.model.TOTALVI.setup_anndata(
+    adata,
+    layer="counts",  # RNA counts
+    protein_expression_obsm_key="protein_expression",  # Protein counts
+    batch_key="batch"
+)
+
+# Train model
+model = scvi.model.TOTALVI(adata)
+model.train()
+
+# Get joint latent representation
+latent = model.get_latent_representation()
+
+# Get normalized values for both modalities
+rna_normalized = model.get_normalized_expression()
+protein_normalized = model.get_normalized_expression(
+    transform_batch="batch1",
+    protein_expression=True
+)
+
+# Differential expression (works for both RNA and protein)
+rna_de = model.differential_expression(groupby="cell_type")
+protein_de = model.differential_expression(
+    groupby="cell_type",
+    protein_expression=True
+)
+```
+
+**Key Parameters**:
+- `n_latent`: Latent space dimensionality (default: 20)
+- `n_layers_encoder`: Number of encoder layers (default: 1)
+- `n_layers_decoder`: Number of decoder layers (default: 1)
+- `protein_dispersion`: Protein dispersion handling ("protein" or "protein-batch")
+- `empirical_protein_background_prior`: Use empirical background for proteins
+
+**Advanced Features**:
+
+**Protein Imputation**:
+```python
+# Impute missing proteins for RNA-only cells
+# (useful for mapping RNA-seq to CITE-seq reference)
+protein_foreground = model.get_protein_foreground_probability()
+imputed_proteins = model.get_normalized_expression(
+    protein_expression=True,
+    n_samples=25
+)
+```
+
+**Denoising**:
+```python
+# Get denoised counts for both modalities
+denoised_rna = model.get_normalized_expression(n_samples=25)
+denoised_protein = model.get_normalized_expression(
+    protein_expression=True,
+    n_samples=25
+)
+```
+
+**Best Practices**:
+1. Use empirical protein background prior for datasets with ambient protein
+2. Consider protein-specific dispersion for heterogeneous protein data
+3. Use joint latent space for clustering (better than RNA alone)
+4. Validate protein imputation with known markers
+5. Check protein QC metrics before training
+
+## MultiVI (Multi-modal Variational Inference)
+
+**Purpose**: Integration of paired and unpaired multi-omic data (e.g., RNA + ATAC, paired and unpaired cells).
+
+**Key Features**:
+- Handles paired data (same cells) and unpaired data (different cells)
+- Integrates multiple modalities: RNA, ATAC, proteins, etc.
+- Missing modality imputation
+- Learns shared representations across modalities
+- Flexible integration strategy
+
+**When to Use**:
+- 10x Multiome data (paired RNA + ATAC)
+- Integrating separate RNA-seq and ATAC-seq experiments
+- Some cells with both modalities, some with only one
+- Cross-modality imputation tasks
+
+**Data Requirements**:
+- AnnData with multiple modalities
+- Modality indicators (which measurements each cell has)
+- Can handle:
+  - All cells with both modalities (fully paired)
+  - Mix of paired and unpaired cells
+  - Completely unpaired datasets
+
+**Basic Usage**:
+```python
+# Prepare data with modality information
+# adata.X should contain all features (genes + peaks)
+# adata.var["modality"] indicates "Gene" or "Peak"
+# adata.obs["modality"] indicates which modality each cell has
+
+scvi.model.MULTIVI.setup_anndata(
+    adata,
+    batch_key="batch",
+    modality_key="modality"  # Column indicating cell modality
+)
+
+model = scvi.model.MULTIVI(adata)
+model.train()
+
+# Get joint latent representation
+latent = model.get_latent_representation()
+
+# Impute missing modalities
+# E.g., predict ATAC for RNA-only cells
+imputed_accessibility = model.get_accessibility_estimates(
+    indices=rna_only_indices
+)
+
+# Get normalized expression/accessibility
+rna_normalized = model.get_normalized_expression()
+atac_normalized = model.get_accessibility_estimates()
+```
+
+**Key Parameters**:
+- `n_genes`: Number of gene features
+- `n_regions`: Number of accessibility regions
+- `n_latent`: Latent dimensionality (default: 20)
+
+**Integration Scenarios**:
+
+**Scenario 1: Fully Paired (10x Multiome)**:
+```python
+# All cells have both RNA and ATAC
+# Single modality key: "paired"
+adata.obs["modality"] = "paired"
+```
+
+**Scenario 2: Partially Paired**:
+```python
+# Some cells have both, some RNA-only, some ATAC-only
+adata.obs["modality"] = ["RNA+ATAC", "RNA", "ATAC", ...]
+```
+
+**Scenario 3: Completely Unpaired**:
+```python
+# Separate RNA and ATAC experiments
+adata.obs["modality"] = ["RNA"] * n_rna + ["ATAC"] * n_atac
+```
+
+**Advanced Use Cases**:
+
+**Cross-Modality Prediction**:
+```python
+# Predict peaks from gene expression
+accessibility_from_rna = model.get_accessibility_estimates(
+    indices=rna_only_cells
+)
+
+# Predict genes from accessibility
+expression_from_atac = model.get_normalized_expression(
+    indices=atac_only_cells
+)
+```
+
+**Modality-Specific Analysis**:
+```python
+# Separate analysis per modality
+rna_subset = adata[adata.obs["modality"].str.contains("RNA")]
+atac_subset = adata[adata.obs["modality"].str.contains("ATAC")]
+```
+
+## MrVI (Multi-resolution Variational Inference)
+
+**Purpose**: Multi-sample analysis accounting for sample-specific and shared variation.
+
+**Key Features**:
+- Simultaneously analyzes multiple samples/conditions
+- Decomposes variation into:
+  - Shared variation (common across samples)
+  - Sample-specific variation
+- Enables sample-level comparisons
+- Identifies sample-specific cell states
+
+**When to Use**:
+- Comparing multiple biological samples or conditions
+- Identifying sample-specific vs. shared cell states
+- Disease vs. healthy sample comparisons
+- Understanding inter-sample heterogeneity
+- Multi-donor studies
+
+**Basic Usage**:
+```python
+scvi.model.MRVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch",
+    sample_key="sample"  # Critical: defines biological samples
+)
+
+model = scvi.model.MRVI(adata, n_latent=10, n_latent_sample=5)
+model.train()
+
+# Get representations
+shared_latent = model.get_latent_representation()  # Shared across samples
+sample_specific = model.get_sample_specific_representation()
+
+# Sample distance matrix
+sample_distances = model.get_sample_distances()
+```
+
+**Key Parameters**:
+- `n_latent`: Dimensionality of shared latent space
+- `n_latent_sample`: Dimensionality of sample-specific space
+- `sample_key`: Column defining biological samples
+
+**Analysis Workflow**:
+```python
+# 1. Identify shared cell types across samples
+sc.pp.neighbors(adata, use_rep="X_MrVI_shared")
+sc.tl.umap(adata)
+sc.tl.leiden(adata, key_added="shared_clusters")
+
+# 2. Analyze sample-specific variation
+sample_repr = model.get_sample_specific_representation()
+
+# 3. Compare samples
+distances = model.get_sample_distances()
+
+# 4. Find sample-enriched genes
+de_results = model.differential_expression(
+    groupby="sample",
+    group1="Disease",
+    group2="Healthy"
+)
+```
+
+**Use Cases**:
+- **Multi-donor studies**: Separate donor effects from cell type variation
+- **Disease studies**: Identify disease-specific vs. shared biology
+- **Time series**: Separate temporal from stable variation
+- **Batch + biology**: Disentangle technical and biological variation
+
+## totalVI vs. MultiVI vs. MrVI: When to Use Which?
+
+### totalVI
+**Use for**: CITE-seq (RNA + protein, same cells)
+- Paired measurements
+- Single modality type per feature
+- Focus: protein imputation, joint analysis
+
+### MultiVI
+**Use for**: Multiple modalities (RNA + ATAC, etc.)
+- Paired, unpaired, or mixed
+- Different feature types
+- Focus: cross-modality integration and imputation
+
+### MrVI
+**Use for**: Multi-sample RNA-seq
+- Single modality (RNA)
+- Multiple biological samples
+- Focus: sample-level variation decomposition
+
+## Integration Best Practices
+
+### For CITE-seq (totalVI)
+1. **Quality control proteins**: Remove low-quality antibodies
+2. **Background subtraction**: Use empirical background prior
+3. **Joint clustering**: Use joint latent space, not RNA alone
+4. **Validation**: Check known markers in both modalities
+
+### For Multiome/Multi-modal (MultiVI)
+1. **Feature filtering**: Filter genes and peaks independently
+2. **Balance modalities**: Ensure reasonable representation of each
+3. **Modality weights**: Consider if one modality dominates
+4. **Imputation validation**: Validate imputed values carefully
+
+### For Multi-sample (MrVI)
+1. **Sample definition**: Carefully define biological samples
+2. **Sample size**: Need sufficient cells per sample
+3. **Covariate handling**: Properly account for batch vs. sample
+4. **Interpretation**: Distinguish technical from biological variation
+
+## Complete Example: CITE-seq Analysis with totalVI
+
+```python
+import scvi
+import scanpy as sc
+
+# 1. Load CITE-seq data
+adata = sc.read_h5ad("cite_seq.h5ad")
+
+# 2. QC and filtering
+sc.pp.filter_genes(adata, min_cells=3)
+sc.pp.highly_variable_genes(adata, n_top_genes=4000)
+
+# Protein QC
+protein_counts = adata.obsm["protein_expression"]
+# Remove low-quality proteins
+
+# 3. Setup totalVI
+scvi.model.TOTALVI.setup_anndata(
+    adata,
+    layer="counts",
+    protein_expression_obsm_key="protein_expression",
+    batch_key="batch"
+)
+
+# 4. Train
+model = scvi.model.TOTALVI(adata, n_latent=20)
+model.train(max_epochs=400)
+
+# 5. Extract joint representation
+latent = model.get_latent_representation()
+adata.obsm["X_totalVI"] = latent
+
+# 6. Clustering on joint space
+sc.pp.neighbors(adata, use_rep="X_totalVI")
+sc.tl.umap(adata)
+sc.tl.leiden(adata, resolution=0.5)
+
+# 7. Differential expression for both modalities
+rna_de = model.differential_expression(
+    groupby="leiden",
+    group1="0",
+    group2="1"
+)
+
+protein_de = model.differential_expression(
+    groupby="leiden",
+    group1="0",
+    group2="1",
+    protein_expression=True
+)
+
+# 8. Save model
+model.save("totalvi_model")
+```
--- a/skills/scvi-tools/references/models-scrna-seq.md
+++ b/skills/scvi-tools/references/models-scrna-seq.md
@@ -0,0 +1,330 @@
+# Single-Cell RNA-seq Models
+
+This document covers core models for analyzing single-cell RNA sequencing data in scvi-tools.
+
+## scVI (Single-Cell Variational Inference)
+
+**Purpose**: Unsupervised analysis, dimensionality reduction, and batch correction for scRNA-seq data.
+
+**Key Features**:
+- Deep generative model based on variational autoencoders (VAE)
+- Learns low-dimensional latent representations that capture biological variation
+- Automatically corrects for batch effects and technical covariates
+- Enables normalized gene expression estimation
+- Supports differential expression analysis
+
+**When to Use**:
+- Initial exploration and dimensionality reduction of scRNA-seq datasets
+- Integrating multiple batches or studies
+- Generating batch-corrected expression matrices
+- Performing probabilistic differential expression analysis
+
+**Basic Usage**:
+```python
+import scvi
+
+# Setup data
+scvi.model.SCVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch"
+)
+
+# Train model
+model = scvi.model.SCVI(adata, n_latent=30)
+model.train()
+
+# Extract results
+latent = model.get_latent_representation()
+normalized = model.get_normalized_expression()
+```
+
+**Key Parameters**:
+- `n_latent`: Dimensionality of latent space (default: 10)
+- `n_layers`: Number of hidden layers (default: 1)
+- `n_hidden`: Number of nodes per hidden layer (default: 128)
+- `dropout_rate`: Dropout rate for neural networks (default: 0.1)
+- `dispersion`: Gene-specific or cell-specific dispersion ("gene" or "gene-batch")
+- `gene_likelihood`: Distribution for data ("zinb", "nb", "poisson")
+
+**Outputs**:
+- `get_latent_representation()`: Batch-corrected low-dimensional embeddings
+- `get_normalized_expression()`: Denoised, normalized expression values
+- `differential_expression()`: Probabilistic DE testing between groups
+- `get_feature_correlation_matrix()`: Gene-gene correlation estimates
+
+## scANVI (Single-Cell ANnotation using Variational Inference)
+
+**Purpose**: Semi-supervised cell type annotation and integration using labeled and unlabeled cells.
+
+**Key Features**:
+- Extends scVI with cell type labels
+- Leverages partially labeled datasets for annotation transfer
+- Performs simultaneous batch correction and cell type prediction
+- Enables query-to-reference mapping
+
+**When to Use**:
+- Annotating new datasets using reference labels
+- Transfer learning from well-annotated to unlabeled datasets
+- Joint analysis of labeled and unlabeled cells
+- Building cell type classifiers with uncertainty quantification
+
+**Basic Usage**:
+```python
+# Option 1: Train from scratch
+scvi.model.SCANVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch",
+    labels_key="cell_type",
+    unlabeled_category="Unknown"
+)
+model = scvi.model.SCANVI(adata)
+model.train()
+
+# Option 2: Initialize from pretrained scVI
+scvi_model = scvi.model.SCVI(adata)
+scvi_model.train()
+scanvi_model = scvi.model.SCANVI.from_scvi_model(
+    scvi_model,
+    unlabeled_category="Unknown"
+)
+scanvi_model.train()
+
+# Predict cell types
+predictions = scanvi_model.predict()
+```
+
+**Key Parameters**:
+- `labels_key`: Column in `adata.obs` containing cell type labels
+- `unlabeled_category`: Label for cells without annotations
+- All scVI parameters are also available
+
+**Outputs**:
+- `predict()`: Cell type predictions for all cells
+- `predict_proba()`: Prediction probabilities
+- `get_latent_representation()`: Cell type-aware latent space
+
+## AUTOZI
+
+**Purpose**: Automatic identification and modeling of zero-inflated genes in scRNA-seq data.
+
+**Key Features**:
+- Distinguishes biological zeros from technical dropout
+- Learns which genes exhibit zero-inflation
+- Provides gene-specific zero-inflation probabilities
+- Improves downstream analysis by accounting for dropout
+
+**When to Use**:
+- Detecting which genes are affected by technical dropout
+- Improving imputation and normalization for sparse datasets
+- Understanding the extent of zero-inflation in your data
+
+**Basic Usage**:
+```python
+scvi.model.AUTOZI.setup_anndata(adata, layer="counts")
+model = scvi.model.AUTOZI(adata)
+model.train()
+
+# Get zero-inflation probabilities per gene
+zi_probs = model.get_alphas_betas()
+```
+
+## VeloVI
+
+**Purpose**: RNA velocity analysis using variational inference.
+
+**Key Features**:
+- Joint modeling of spliced and unspliced RNA counts
+- Probabilistic estimation of RNA velocity
+- Accounts for technical noise and batch effects
+- Provides uncertainty quantification for velocity estimates
+
+**When to Use**:
+- Inferring cellular dynamics and differentiation trajectories
+- Analyzing spliced/unspliced count data
+- RNA velocity analysis with batch correction
+
+**Basic Usage**:
+```python
+import scvelo as scv
+
+# Prepare velocity data
+scv.pp.filter_and_normalize(adata)
+scv.pp.moments(adata)
+
+# Train VeloVI
+scvi.model.VELOVI.setup_anndata(adata, spliced_layer="Ms", unspliced_layer="Mu")
+model = scvi.model.VELOVI(adata)
+model.train()
+
+# Get velocity estimates
+latent_time = model.get_latent_time()
+velocities = model.get_velocity()
+```
+
+## contrastiveVI
+
+**Purpose**: Isolating perturbation-specific variations from background biological variation.
+
+**Key Features**:
+- Separates shared variation (common across conditions) from target-specific variation
+- Useful for perturbation studies (drug treatments, genetic perturbations)
+- Identifies condition-specific gene programs
+- Enables discovery of treatment-specific effects
+
+**When to Use**:
+- Analyzing perturbation experiments (drug screens, CRISPR, etc.)
+- Identifying genes responding specifically to treatments
+- Separating treatment effects from background variation
+- Comparing control vs. perturbed conditions
+
+**Basic Usage**:
+```python
+scvi.model.CONTRASTIVEVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch",
+    categorical_covariate_keys=["condition"]  # control vs treated
+)
+
+model = scvi.model.CONTRASTIVEVI(
+    adata,
+    n_latent=10,        # Shared variation
+    n_latent_target=5   # Target-specific variation
+)
+model.train()
+
+# Extract representations
+shared = model.get_latent_representation(representation="shared")
+target_specific = model.get_latent_representation(representation="target")
+```
+
+## CellAssign
+
+**Purpose**: Marker-based cell type annotation using known marker genes.
+
+**Key Features**:
+- Uses prior knowledge of marker genes for cell types
+- Probabilistic assignment of cells to types
+- Handles marker gene overlap and ambiguity
+- Provides soft assignments with uncertainty
+
+**When to Use**:
+- Annotating cells using known marker genes
+- Leveraging existing biological knowledge for classification
+- Cases where marker gene lists are available but reference datasets are not
+
+**Basic Usage**:
+```python
+# Create marker gene matrix (cell types x genes)
+marker_gene_mat = pd.DataFrame({
+    "CD4 T cells": [1, 1, 0, 0],  # CD3D, CD4, CD8A, CD19
+    "CD8 T cells": [1, 0, 1, 0],
+    "B cells": [0, 0, 0, 1]
+}, index=["CD3D", "CD4", "CD8A", "CD19"])
+
+scvi.model.CELLASSIGN.setup_anndata(adata, layer="counts")
+model = scvi.model.CELLASSIGN(adata, marker_gene_mat)
+model.train()
+
+predictions = model.predict()
+```
+
+## Solo (Doublet Detection)
+
+**Purpose**: Identifying doublets (cells containing two or more cells) in scRNA-seq data.
+
+**Key Features**:
+- Semi-supervised doublet detection using scVI embeddings
+- Simulates artificial doublets for training
+- Provides doublet probability scores
+- Can be applied to any scVI model
+
+**When to Use**:
+- Quality control of scRNA-seq datasets
+- Removing doublets before downstream analysis
+- Assessing doublet rates in your data
+
+**Basic Usage**:
+```python
+# Train scVI model first
+scvi.model.SCVI.setup_anndata(adata, layer="counts")
+scvi_model = scvi.model.SCVI(adata)
+scvi_model.train()
+
+# Train Solo for doublet detection
+solo_model = scvi.external.SOLO.from_scvi_model(scvi_model)
+solo_model.train()
+
+# Predict doublets
+predictions = solo_model.predict()
+doublet_scores = predictions["doublet"]
+adata.obs["doublet_score"] = doublet_scores
+```
+
+## Amortized LDA (Topic Modeling)
+
+**Purpose**: Topic modeling for gene expression using Latent Dirichlet Allocation.
+
+**Key Features**:
+- Discovers gene expression programs (topics)
+- Amortized variational inference for scalability
+- Each cell is a mixture of topics
+- Each topic is a distribution over genes
+
+**When to Use**:
+- Discovering gene programs or expression modules
+- Understanding compositional structure of expression
+- Alternative dimensionality reduction approach
+- Interpretable decomposition of expression patterns
+
+**Basic Usage**:
+```python
+scvi.model.AMORTIZEDLDA.setup_anndata(adata, layer="counts")
+model = scvi.model.AMORTIZEDLDA(adata, n_topics=10)
+model.train()
+
+# Get topic compositions per cell
+topic_proportions = model.get_latent_representation()
+
+# Get gene loadings per topic
+topic_gene_loadings = model.get_topic_distribution()
+```
+
+## Model Selection Guidelines
+
+**Choose scVI when**:
+- Starting with unsupervised analysis
+- Need batch correction and integration
+- Want normalized expression and DE analysis
+
+**Choose scANVI when**:
+- Have some labeled cells for training
+- Need cell type annotation
+- Want to transfer labels from reference to query
+
+**Choose AUTOZI when**:
+- Concerned about technical dropout
+- Need to identify zero-inflated genes
+- Working with very sparse datasets
+
+**Choose VeloVI when**:
+- Have spliced/unspliced count data
+- Interested in cellular dynamics
+- Need RNA velocity with batch correction
+
+**Choose contrastiveVI when**:
+- Analyzing perturbation experiments
+- Need to separate treatment effects
+- Want to identify condition-specific programs
+
+**Choose CellAssign when**:
+- Have marker gene lists available
+- Want probabilistic marker-based annotation
+- No reference dataset available
+
+**Choose Solo when**:
+- Need doublet detection
+- Already using scVI for analysis
+- Want probabilistic doublet scores
--- a/skills/scvi-tools/references/models-spatial.md
+++ b/skills/scvi-tools/references/models-spatial.md
@@ -0,0 +1,438 @@
+# Spatial Transcriptomics Models
+
+This document covers models for analyzing spatially-resolved transcriptomics data in scvi-tools.
+
+## DestVI (Deconvolution of Spatial Transcriptomics using Variational Inference)
+
+**Purpose**: Multi-resolution deconvolution of spatial transcriptomics using single-cell reference data.
+
+**Key Features**:
+- Estimates cell type proportions at each spatial location
+- Uses single-cell RNA-seq reference for deconvolution
+- Multi-resolution approach (global and local patterns)
+- Accounts for spatial correlation
+- Provides uncertainty quantification
+
+**When to Use**:
+- Deconvolving Visium or similar spatial transcriptomics
+- Have scRNA-seq reference data with cell type labels
+- Want to map cell types to spatial locations
+- Interested in spatial organization of cell types
+- Need probabilistic estimates of cell type abundance
+
+**Data Requirements**:
+- **Spatial data**: Visium or similar spot-based measurements (target data)
+- **Single-cell reference**: scRNA-seq with cell type annotations
+- Both datasets should share genes
+
+**Basic Usage**:
+```python
+import scvi
+
+# Step 1: Train scVI on single-cell reference
+scvi.model.SCVI.setup_anndata(sc_adata, layer="counts")
+sc_model = scvi.model.SCVI(sc_adata)
+sc_model.train()
+
+# Step 2: Setup spatial data
+scvi.model.DESTVI.setup_anndata(
+    spatial_adata,
+    layer="counts"
+)
+
+# Step 3: Train DestVI using reference
+model = scvi.model.DESTVI.from_rna_model(
+    spatial_adata,
+    sc_model,
+    cell_type_key="cell_type"  # Cell type labels in reference
+)
+model.train(max_epochs=2500)
+
+# Step 4: Get cell type proportions
+proportions = model.get_proportions()
+spatial_adata.obsm["proportions"] = proportions
+
+# Step 5: Get cell type-specific expression
+# Expression of genes specific to each cell type at each spot
+ct_expression = model.get_scale_for_ct("T cells")
+```
+
+**Key Parameters**:
+- `amortization`: Amortization strategy ("both", "latent", "proportion")
+- `n_latent`: Latent dimensionality (inherited from scVI model)
+
+**Outputs**:
+- `get_proportions()`: Cell type proportions at each spot
+- `get_scale_for_ct(cell_type)`: Cell type-specific expression patterns
+- `get_gamma()`: Proportion-specific gene expression scaling
+
+**Visualization**:
+```python
+import scanpy as sc
+import matplotlib.pyplot as plt
+
+# Visualize specific cell type proportions spatially
+sc.pl.spatial(
+    spatial_adata,
+    color="T cells",  # If proportions added to .obs
+    spot_size=150
+)
+
+# Or use obsm directly
+for ct in cell_types:
+    plt.figure()
+    sc.pl.spatial(
+        spatial_adata,
+        color=spatial_adata.obsm["proportions"][ct],
+        title=f"{ct} proportions"
+    )
+```
+
+## Stereoscope
+
+**Purpose**: Cell type deconvolution for spatial transcriptomics using probabilistic modeling.
+
+**Key Features**:
+- Reference-based deconvolution
+- Probabilistic framework for cell type proportions
+- Works with various spatial technologies
+- Handles gene selection and normalization
+
+**When to Use**:
+- Similar to DestVI but simpler approach
+- Deconvolving spatial data with reference
+- Faster alternative for basic deconvolution
+
+**Basic Usage**:
+```python
+scvi.model.STEREOSCOPE.setup_anndata(
+    sc_adata,
+    labels_key="cell_type",
+    layer="counts"
+)
+
+# Train on reference
+ref_model = scvi.model.STEREOSCOPE(sc_adata)
+ref_model.train()
+
+# Setup spatial data
+scvi.model.STEREOSCOPE.setup_anndata(spatial_adata, layer="counts")
+
+# Transfer to spatial
+spatial_model = scvi.model.STEREOSCOPE.from_reference_model(
+    spatial_adata,
+    ref_model
+)
+spatial_model.train()
+
+# Get proportions
+proportions = spatial_model.get_proportions()
+```
+
+## Tangram
+
+**Purpose**: Spatial mapping and integration of single-cell data to spatial locations.
+
+**Key Features**:
+- Maps single cells to spatial coordinates
+- Learns optimal transport between single-cell and spatial data
+- Gene imputation at spatial locations
+- Cell type mapping
+
+**When to Use**:
+- Mapping cells from scRNA-seq to spatial locations
+- Imputing unmeasured genes in spatial data
+- Understanding spatial organization at single-cell resolution
+- Integrating scRNA-seq and spatial transcriptomics
+
+**Data Requirements**:
+- Single-cell RNA-seq data with annotations
+- Spatial transcriptomics data
+- Shared genes between modalities
+
+**Basic Usage**:
+```python
+import tangram as tg
+
+# Map cells to spatial locations
+ad_map = tg.map_cells_to_space(
+    adata_sc=sc_adata,
+    adata_sp=spatial_adata,
+    mode="cells",  # or "clusters" for cell type mapping
+    density_prior="rna_count_based"
+)
+
+# Get mapping matrix (cells × spots)
+mapping = ad_map.X
+
+# Project cell annotations to space
+tg.project_cell_annotations(
+    ad_map,
+    spatial_adata,
+    annotation="cell_type"
+)
+
+# Impute genes in spatial data
+genes_to_impute = ["CD3D", "CD8A", "CD4"]
+tg.project_genes(ad_map, spatial_adata, genes=genes_to_impute)
+```
+
+**Visualization**:
+```python
+# Visualize cell type mapping
+sc.pl.spatial(
+    spatial_adata,
+    color="cell_type_projected",
+    spot_size=100
+)
+```
+
+## gimVI (Gaussian Identity Multivi for Imputation)
+
+**Purpose**: Cross-modality imputation between spatial and single-cell data.
+
+**Key Features**:
+- Joint model of spatial and single-cell data
+- Imputes missing genes in spatial data
+- Enables cross-dataset queries
+- Learns shared representations
+
+**When to Use**:
+- Imputing genes not measured in spatial data
+- Joint analysis of spatial and single-cell datasets
+- Mapping between modalities
+
+**Basic Usage**:
+```python
+# Combine datasets
+combined_adata = sc.concat([sc_adata, spatial_adata])
+
+scvi.model.GIMVI.setup_anndata(
+    combined_adata,
+    layer="counts"
+)
+
+model = scvi.model.GIMVI(combined_adata)
+model.train()
+
+# Impute genes in spatial data
+imputed = model.get_imputed_values(spatial_indices)
+```
+
+## scVIVA (Variation in Variational Autoencoders for Spatial)
+
+**Purpose**: Analyzing cell-environment relationships in spatial data.
+
+**Key Features**:
+- Models cellular neighborhoods and environments
+- Identifies environment-associated gene expression
+- Accounts for spatial correlation structure
+- Cell-cell interaction analysis
+
+**When to Use**:
+- Understanding how spatial context affects cells
+- Identifying niche-specific gene programs
+- Cell-cell interaction studies
+- Microenvironment analysis
+
+**Data Requirements**:
+- Spatial transcriptomics with coordinates
+- Cell type annotations (optional)
+
+**Basic Usage**:
+```python
+scvi.model.SCVIVA.setup_anndata(
+    spatial_adata,
+    layer="counts",
+    spatial_key="spatial"  # Coordinates in .obsm
+)
+
+model = scvi.model.SCVIVA(spatial_adata)
+model.train()
+
+# Get environment representations
+env_latent = model.get_environment_representation()
+
+# Identify environment-associated genes
+env_genes = model.get_environment_specific_genes()
+```
+
+## ResolVI
+
+**Purpose**: Addressing spatial transcriptomics noise through resolution-aware modeling.
+
+**Key Features**:
+- Accounts for spatial resolution effects
+- Denoises spatial data
+- Multi-scale analysis
+- Improves downstream analysis quality
+
+**When to Use**:
+- Noisy spatial data
+- Multiple spatial resolutions
+- Need denoising before analysis
+- Improving data quality
+
+**Basic Usage**:
+```python
+scvi.model.RESOLVI.setup_anndata(
+    spatial_adata,
+    layer="counts",
+    spatial_key="spatial"
+)
+
+model = scvi.model.RESOLVI(spatial_adata)
+model.train()
+
+# Get denoised expression
+denoised = model.get_denoised_expression()
+```
+
+## Model Selection for Spatial Transcriptomics
+
+### DestVI
+**Choose when**:
+- Need detailed deconvolution with reference
+- Have high-quality scRNA-seq reference
+- Want multi-resolution analysis
+- Need uncertainty quantification
+
+**Best for**: Visium, spot-based technologies
+
+### Stereoscope
+**Choose when**:
+- Need simpler, faster deconvolution
+- Basic cell type proportion estimates
+- Limited computational resources
+
+**Best for**: Quick deconvolution tasks
+
+### Tangram
+**Choose when**:
+- Want single-cell resolution mapping
+- Need to impute many genes
+- Interested in cell positioning
+- Optimal transport approach preferred
+
+**Best for**: Detailed spatial mapping
+
+### gimVI
+**Choose when**:
+- Need bidirectional imputation
+- Joint modeling of spatial and single-cell
+- Cross-dataset queries
+
+**Best for**: Integration and imputation
+
+### scVIVA
+**Choose when**:
+- Interested in cellular environments
+- Cell-cell interaction analysis
+- Neighborhood effects
+
+**Best for**: Microenvironment studies
+
+### ResolVI
+**Choose when**:
+- Data quality is a concern
+- Need denoising
+- Multi-scale analysis
+
+**Best for**: Noisy data preprocessing
+
+## Complete Workflow: Spatial Deconvolution with DestVI
+
+```python
+import scvi
+import scanpy as sc
+import squidpy as sq
+
+# ===== Part 1: Prepare single-cell reference =====
+# Load and process scRNA-seq reference
+sc_adata = sc.read_h5ad("reference_scrna.h5ad")
+
+# QC and filtering
+sc.pp.filter_genes(sc_adata, min_cells=10)
+sc.pp.highly_variable_genes(sc_adata, n_top_genes=4000)
+
+# Train scVI on reference
+scvi.model.SCVI.setup_anndata(
+    sc_adata,
+    layer="counts",
+    batch_key="batch"
+)
+
+sc_model = scvi.model.SCVI(sc_adata)
+sc_model.train(max_epochs=400)
+
+# ===== Part 2: Load spatial data =====
+spatial_adata = sc.read_visium("path/to/visium")
+spatial_adata.var_names_make_unique()
+
+# QC spatial data
+sc.pp.filter_genes(spatial_adata, min_cells=10)
+
+# ===== Part 3: Run DestVI =====
+scvi.model.DESTVI.setup_anndata(
+    spatial_adata,
+    layer="counts"
+)
+
+destvi_model = scvi.model.DESTVI.from_rna_model(
+    spatial_adata,
+    sc_model,
+    cell_type_key="cell_type"
+)
+
+destvi_model.train(max_epochs=2500)
+
+# ===== Part 4: Extract results =====
+# Get proportions
+proportions = destvi_model.get_proportions()
+spatial_adata.obsm["proportions"] = proportions
+
+# Add proportions to .obs for easy plotting
+for i, ct in enumerate(sc_model.adata.obs["cell_type"].cat.categories):
+    spatial_adata.obs[f"prop_{ct}"] = proportions[:, i]
+
+# ===== Part 5: Visualization =====
+# Plot specific cell types
+cell_types = ["T cells", "B cells", "Macrophages"]
+
+for ct in cell_types:
+    sc.pl.spatial(
+        spatial_adata,
+        color=f"prop_{ct}",
+        title=f"{ct} proportions",
+        spot_size=150,
+        cmap="viridis"
+    )
+
+# ===== Part 6: Spatial analysis =====
+# Compute spatial neighbors
+sq.gr.spatial_neighbors(spatial_adata)
+
+# Spatial autocorrelation of cell types
+for ct in cell_types:
+    sq.gr.spatial_autocorr(
+        spatial_adata,
+        attr="obs",
+        mode="moran",
+        genes=[f"prop_{ct}"]
+    )
+
+# ===== Part 7: Save results =====
+destvi_model.save("destvi_model")
+spatial_adata.write("spatial_deconvolved.h5ad")
+```
+
+## Best Practices for Spatial Analysis
+
+1. **Reference quality**: Use high-quality, well-annotated scRNA-seq reference
+2. **Gene overlap**: Ensure sufficient shared genes between reference and spatial
+3. **Spatial coordinates**: Properly register spatial coordinates in `.obsm["spatial"]`
+4. **Validation**: Use known marker genes to validate deconvolution
+5. **Visualization**: Always visualize results spatially to check biological plausibility
+6. **Cell type granularity**: Consider appropriate cell type resolution
+7. **Computational resources**: Spatial models can be memory-intensive
+8. **Quality control**: Filter low-quality spots before analysis
--- a/skills/scvi-tools/references/models-specialized.md
+++ b/skills/scvi-tools/references/models-specialized.md
@@ -0,0 +1,408 @@
+# Specialized Modality Models
+
+This document covers models for specialized single-cell data modalities in scvi-tools.
+
+## MethylVI / MethylANVI (Methylation Analysis)
+
+**Purpose**: Analysis of single-cell bisulfite sequencing (scBS-seq) data for DNA methylation.
+
+**Key Features**:
+- Models methylation patterns at single-cell resolution
+- Handles sparsity in methylation data
+- Batch correction for methylation experiments
+- Label transfer (MethylANVI) for cell type annotation
+
+**When to Use**:
+- Analyzing scBS-seq or similar methylation data
+- Studying DNA methylation patterns across cell types
+- Integrating methylation data across batches
+- Cell type annotation based on methylation profiles
+
+**Data Requirements**:
+- Methylation count matrices (methylated vs. total reads per CpG site)
+- Format: Cells × CpG sites with methylation ratios or counts
+
+### MethylVI (Unsupervised)
+
+**Basic Usage**:
+```python
+import scvi
+
+# Setup methylation data
+scvi.model.METHYLVI.setup_anndata(
+    adata,
+    layer="methylation_counts",  # Methylation data
+    batch_key="batch"
+)
+
+model = scvi.model.METHYLVI(adata)
+model.train()
+
+# Get latent representation
+latent = model.get_latent_representation()
+
+# Get normalized methylation values
+normalized_meth = model.get_normalized_methylation()
+```
+
+### MethylANVI (Semi-supervised with cell types)
+
+**Basic Usage**:
+```python
+# Setup with cell type labels
+scvi.model.METHYLANVI.setup_anndata(
+    adata,
+    layer="methylation_counts",
+    batch_key="batch",
+    labels_key="cell_type",
+    unlabeled_category="Unknown"
+)
+
+model = scvi.model.METHYLANVI(adata)
+model.train()
+
+# Predict cell types
+predictions = model.predict()
+```
+
+**Key Parameters**:
+- `n_latent`: Latent dimensionality
+- `region_factors`: Model region-specific effects
+
+**Use Cases**:
+- Epigenetic heterogeneity analysis
+- Cell type identification via methylation
+- Integration with gene expression data (separate analysis)
+- Differential methylation analysis
+
+## CytoVI (Flow and Mass Cytometry)
+
+**Purpose**: Batch correction and integration of flow cytometry and mass cytometry (CyTOF) data.
+
+**Key Features**:
+- Handles antibody-based protein measurements
+- Corrects batch effects in cytometry data
+- Enables integration across experiments
+- Designed for high-dimensional protein panels
+
+**When to Use**:
+- Analyzing flow cytometry or CyTOF data
+- Integrating cytometry experiments across batches
+- Batch correction for protein panels
+- Cross-study cytometry integration
+
+**Data Requirements**:
+- Protein expression matrix (cells × proteins)
+- Flow cytometry or CyTOF measurements
+- Batch/experiment annotations
+
+**Basic Usage**:
+```python
+scvi.model.CYTOVI.setup_anndata(
+    adata,
+    protein_expression_obsm_key="protein_expression",
+    batch_key="batch"
+)
+
+model = scvi.model.CYTOVI(adata)
+model.train()
+
+# Get batch-corrected representation
+latent = model.get_latent_representation()
+
+# Get normalized protein values
+normalized = model.get_normalized_expression()
+```
+
+**Key Parameters**:
+- `n_latent`: Latent space dimensionality
+- `n_layers`: Network depth
+
+**Typical Workflow**:
+```python
+import scanpy as sc
+
+# 1. Load cytometry data
+adata = sc.read_h5ad("cytof_data.h5ad")
+
+# 2. Train CytoVI
+scvi.model.CYTOVI.setup_anndata(
+    adata,
+    protein_expression_obsm_key="protein",
+    batch_key="experiment"
+)
+model = scvi.model.CYTOVI(adata)
+model.train()
+
+# 3. Get batch-corrected values
+latent = model.get_latent_representation()
+adata.obsm["X_CytoVI"] = latent
+
+# 4. Downstream analysis
+sc.pp.neighbors(adata, use_rep="X_CytoVI")
+sc.tl.umap(adata)
+sc.tl.leiden(adata)
+
+# 5. Visualize batch correction
+sc.pl.umap(adata, color=["batch", "leiden"])
+```
+
+## SysVI (Systems-level Integration)
+
+**Purpose**: Batch effect correction with emphasis on preserving biological variation.
+
+**Key Features**:
+- Specialized batch integration approach
+- Preserves biological signals while removing technical effects
+- Designed for large-scale integration studies
+
+**When to Use**:
+- Large-scale multi-batch integration
+- Need to preserve subtle biological variation
+- Systems-level analysis across many studies
+
+**Basic Usage**:
+```python
+scvi.model.SYSVI.setup_anndata(
+    adata,
+    layer="counts",
+    batch_key="batch"
+)
+
+model = scvi.model.SYSVI(adata)
+model.train()
+
+latent = model.get_latent_representation()
+```
+
+## Decipher (Trajectory Inference)
+
+**Purpose**: Trajectory inference and pseudotime analysis for single-cell data.
+
+**Key Features**:
+- Learns cellular trajectories and differentiation paths
+- Pseudotime estimation
+- Accounts for uncertainty in trajectory structure
+- Compatible with scVI embeddings
+
+**When to Use**:
+- Studying cellular differentiation
+- Time-course or developmental datasets
+- Understanding cell state transitions
+- Identifying branching points in development
+
+**Basic Usage**:
+```python
+# Typically used after scVI for embeddings
+scvi_model = scvi.model.SCVI(adata)
+scvi_model.train()
+
+# Decipher for trajectory
+scvi.model.DECIPHER.setup_anndata(adata)
+decipher_model = scvi.model.DECIPHER(adata, scvi_model)
+decipher_model.train()
+
+# Get pseudotime
+pseudotime = decipher_model.get_pseudotime()
+adata.obs["pseudotime"] = pseudotime
+```
+
+**Visualization**:
+```python
+import scanpy as sc
+
+# Plot pseudotime on UMAP
+sc.pl.umap(adata, color="pseudotime", cmap="viridis")
+
+# Gene expression along pseudotime
+sc.pl.scatter(adata, x="pseudotime", y="gene_of_interest")
+```
+
+## peRegLM (Peak Regulatory Linear Model)
+
+**Purpose**: Linking chromatin accessibility to gene expression for regulatory analysis.
+
+**Key Features**:
+- Links ATAC-seq peaks to gene expression
+- Identifies regulatory relationships
+- Works with paired multiome data
+
+**When to Use**:
+- Multiome data (RNA + ATAC from same cells)
+- Understanding gene regulation
+- Linking peaks to target genes
+- Regulatory network construction
+
+**Basic Usage**:
+```python
+# Requires paired RNA + ATAC data
+scvi.model.PEREGLM.setup_anndata(
+    multiome_adata,
+    rna_layer="counts",
+    atac_layer="atac_counts"
+)
+
+model = scvi.model.PEREGLM(multiome_adata)
+model.train()
+
+# Get peak-gene links
+peak_gene_links = model.get_regulatory_links()
+```
+
+## Model-Specific Best Practices
+
+### MethylVI/MethylANVI
+1. **Sparsity**: Methylation data is inherently sparse; model accounts for this
+2. **CpG selection**: Filter CpGs with very low coverage
+3. **Biological interpretation**: Consider genomic context (promoters, enhancers)
+4. **Integration**: For multi-omics, analyze separately then integrate results
+
+### CytoVI
+1. **Protein QC**: Remove low-quality or uninformative proteins
+2. **Compensation**: Ensure proper spectral compensation before analysis
+3. **Batch design**: Include biological and technical replicates
+4. **Controls**: Use control samples to validate batch correction
+
+### SysVI
+1. **Sample size**: Designed for large-scale integration
+2. **Batch definition**: Carefully define batch structure
+3. **Biological validation**: Verify biological signals preserved
+
+### Decipher
+1. **Start point**: Define trajectory start cells if known
+2. **Branching**: Specify expected number of branches
+3. **Validation**: Use known markers to validate pseudotime
+4. **Integration**: Works well with scVI embeddings
+
+## Integration with Other Models
+
+Many specialized models work well in combination:
+
+**Methylation + Expression**:
+```python
+# Analyze separately, then integrate
+methylvi_model = scvi.model.METHYLVI(meth_adata)
+scvi_model = scvi.model.SCVI(rna_adata)
+
+# Integrate results at analysis level
+# E.g., correlate methylation and expression patterns
+```
+
+**Cytometry + CITE-seq**:
+```python
+# CytoVI for flow/CyTOF
+cyto_model = scvi.model.CYTOVI(cyto_adata)
+
+# totalVI for CITE-seq
+cite_model = scvi.model.TOTALVI(cite_adata)
+
+# Compare protein measurements across platforms
+```
+
+**ATAC + RNA (Multiome)**:
+```python
+# MultiVI for joint analysis
+multivi_model = scvi.model.MULTIVI(multiome_adata)
+
+# peRegLM for regulatory links
+pereglm_model = scvi.model.PEREGLM(multiome_adata)
+```
+
+## Choosing Specialized Models
+
+### Decision Tree
+
+1. **What data modality?**
+   - Methylation → MethylVI/MethylANVI
+   - Flow/CyTOF → CytoVI
+   - Trajectory → Decipher
+   - Multi-batch integration → SysVI
+   - Regulatory links → peRegLM
+
+2. **Do you have labels?**
+   - Yes → MethylANVI (methylation)
+   - No → MethylVI (methylation)
+
+3. **What's your main goal?**
+   - Batch correction → CytoVI, SysVI
+   - Trajectory/pseudotime → Decipher
+   - Peak-gene links → peRegLM
+   - Methylation patterns → MethylVI/ANVI
+
+## Example: Complete Methylation Analysis
+
+```python
+import scvi
+import scanpy as sc
+
+# 1. Load methylation data
+meth_adata = sc.read_h5ad("methylation_data.h5ad")
+
+# 2. QC: filter low-coverage CpG sites
+sc.pp.filter_genes(meth_adata, min_cells=10)
+
+# 3. Setup MethylVI
+scvi.model.METHYLVI.setup_anndata(
+    meth_adata,
+    layer="methylation",
+    batch_key="batch"
+)
+
+# 4. Train model
+model = scvi.model.METHYLVI(meth_adata, n_latent=15)
+model.train(max_epochs=400)
+
+# 5. Get latent representation
+latent = model.get_latent_representation()
+meth_adata.obsm["X_MethylVI"] = latent
+
+# 6. Clustering
+sc.pp.neighbors(meth_adata, use_rep="X_MethylVI")
+sc.tl.umap(meth_adata)
+sc.tl.leiden(meth_adata)
+
+# 7. Differential methylation
+dm_results = model.differential_methylation(
+    groupby="leiden",
+    group1="0",
+    group2="1"
+)
+
+# 8. Save
+model.save("methylvi_model")
+meth_adata.write("methylation_analyzed.h5ad")
+```
+
+## External Tools Integration
+
+Some specialized models are available as external packages:
+
+**SOLO** (doublet detection):
+```python
+from scvi.external import SOLO
+
+solo = SOLO.from_scvi_model(scvi_model)
+solo.train()
+doublets = solo.predict()
+```
+
+**scArches** (reference mapping):
+```python
+from scvi.external import SCARCHES
+
+# For transfer learning and query-to-reference mapping
+```
+
+These external tools extend scvi-tools functionality for specific use cases.
+
+## Summary Table
+
+| Model | Data Type | Primary Use | Supervised? |
+|-------|-----------|-------------|-------------|
+| MethylVI | Methylation | Unsupervised analysis | No |
+| MethylANVI | Methylation | Cell type annotation | Semi |
+| CytoVI | Cytometry | Batch correction | No |
+| SysVI | scRNA-seq | Large-scale integration | No |
+| Decipher | scRNA-seq | Trajectory inference | No |
+| peRegLM | Multiome | Peak-gene links | No |
+| SOLO | scRNA-seq | Doublet detection | Semi |
--- a/skills/scvi-tools/references/theoretical-foundations.md
+++ b/skills/scvi-tools/references/theoretical-foundations.md
@@ -0,0 +1,438 @@
+# Theoretical Foundations of scvi-tools
+
+This document explains the mathematical and statistical principles underlying scvi-tools.
+
+## Core Concepts
+
+### Variational Inference
+
+**What is it?**
+Variational inference is a technique for approximating complex probability distributions. In single-cell analysis, we want to understand the posterior distribution p(z|x) - the probability of latent variables z given observed data x.
+
+**Why use it?**
+- Exact inference is computationally intractable for complex models
+- Scales to large datasets (millions of cells)
+- Provides uncertainty quantification
+- Enables Bayesian reasoning about cell states
+
+**How does it work?**
+1. Define a simpler approximate distribution q(z|x) with learnable parameters
+2. Minimize the KL divergence between q(z|x) and true posterior p(z|x)
+3. Equivalent to maximizing the Evidence Lower Bound (ELBO)
+
+**ELBO Objective**:
+```
+ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
+       ↑                    ↑
+  Reconstruction          Regularization
+```
+
+- **Reconstruction term**: Model should generate data similar to observed
+- **Regularization term**: Latent representation should match prior
+
+### Variational Autoencoders (VAEs)
+
+**Architecture**:
+```
+x (observed data)
+    ↓
+[Encoder Neural Network]
+    ↓
+z (latent representation)
+    ↓
+[Decoder Neural Network]
+    ↓
+x̂ (reconstructed data)
+```
+
+**Encoder**: Maps cells (x) to latent space (z)
+- Learns q(z|x), the approximate posterior
+- Parameterized by neural network with learnable weights
+- Outputs mean and variance of latent distribution
+
+**Decoder**: Maps latent space (z) back to gene space
+- Learns p(x|z), the likelihood
+- Generates gene expression from latent representation
+- Models count distributions (Negative Binomial, Zero-Inflated NB)
+
+**Reparameterization Trick**:
+- Allows backpropagation through stochastic sampling
+- Sample z = μ + σ ⊙ ε, where ε ~ N(0,1)
+- Enables end-to-end training with gradient descent
+
+### Amortized Inference
+
+**Concept**: Share encoder parameters across all cells.
+
+**Traditional inference**: Learn separate latent variables for each cell
+- n_cells × n_latent parameters
+- Doesn't scale to large datasets
+
+**Amortized inference**: Learn single encoder for all cells
+- Fixed number of parameters regardless of cell count
+- Enables fast inference on new cells
+- Transfers learned patterns across dataset
+
+**Benefits**:
+- Scalable to millions of cells
+- Fast inference on query data
+- Leverages shared structure across cells
+- Enables few-shot learning
+
+## Statistical Modeling
+
+### Count Data Distributions
+
+Single-cell data are counts (integer-valued), requiring appropriate distributions.
+
+#### Negative Binomial (NB)
+```
+x ~ NB(μ, θ)
+```
+- **μ (mean)**: Expected expression level
+- **θ (dispersion)**: Controls variance
+- **Variance**: Var(x) = μ + μ²/θ
+
+**When to use**: Gene expression without zero-inflation
+- More flexible than Poisson (allows overdispersion)
+- Models technical and biological variation
+
+#### Zero-Inflated Negative Binomial (ZINB)
+```
+x ~ π·δ₀ + (1-π)·NB(μ, θ)
+```
+- **π (dropout rate)**: Probability of technical zero
+- **δ₀**: Point mass at zero
+- **NB(μ, θ)**: Expression when not dropped out
+
+**When to use**: Sparse scRNA-seq data
+- Models technical dropout separately from biological zeros
+- Better fit for highly sparse data (e.g., 10x data)
+
+#### Poisson
+```
+x ~ Poisson(μ)
+```
+- Simplest count distribution
+- Mean equals variance: Var(x) = μ
+
+**When to use**: Less common; ATAC-seq fragment counts
+- More restrictive than NB
+- Faster computation
+
+### Batch Correction Framework
+
+**Problem**: Technical variation confounds biological signal
+- Different sequencing runs, protocols, labs
+- Must remove technical effects while preserving biology
+
+**scvi-tools approach**:
+1. Encode batch as categorical variable s
+2. Include s in generative model
+3. Latent space z is batch-invariant
+4. Decoder conditions on s for batch-specific effects
+
+**Mathematical formulation**:
+```
+Encoder: q(z|x, s)  - batch-aware encoding
+Latent: z           - batch-corrected representation
+Decoder: p(x|z, s)  - batch-specific decoding
+```
+
+**Key insight**: Batch info flows through decoder, not latent space
+- z captures biological variation
+- s explains technical variation
+- Separable biology and batch effects
+
+### Deep Generative Modeling
+
+**Generative model**: Learns p(x), the data distribution
+
+**Process**:
+1. Sample latent variable: z ~ p(z) = N(0, I)
+2. Generate expression: x ~ p(x|z)
+3. Joint distribution: p(x, z) = p(x|z)p(z)
+
+**Benefits**:
+- Generate synthetic cells
+- Impute missing values
+- Quantify uncertainty
+- Perform counterfactual predictions
+
+**Inference network**: Inverts generative process
+- Given x, infer z
+- q(z|x) approximates true posterior p(z|x)
+
+## Model Architecture Details
+
+### scVI Architecture
+
+**Input**: Gene expression counts x ∈ ℕ^G (G genes)
+
+**Encoder**:
+```
+h = ReLU(W₁·x + b₁)
+μ_z = W₂·h + b₂
+log σ²_z = W₃·h + b₃
+z ~ N(μ_z, σ²_z)
+```
+
+**Latent space**: z ∈ ℝ^d (typically d=10-30)
+
+**Decoder**:
+```
+h = ReLU(W₄·z + b₄)
+μ = softmax(W₅·h + b₅) · library_size
+θ = exp(W₆·h + b₆)
+π = sigmoid(W₇·h + b₇)  # for ZINB
+x ~ ZINB(μ, θ, π)
+```
+
+**Loss function (ELBO)**:
+```
+L = E_q[log p(x|z)] - KL(q(z|x) || N(0,I))
+```
+
+### Handling Covariates
+
+**Categorical covariates** (batch, donor, etc.):
+- One-hot encoded: s ∈ {0,1}^K
+- Concatenate with latent: [z, s]
+- Or use conditional layers
+
+**Continuous covariates** (library size, percent_mito):
+- Standardize to zero mean, unit variance
+- Include in encoder and/or decoder
+
+**Covariate injection strategies**:
+- **Concatenation**: [z, s] fed to decoder
+- **Deep injection**: s added at multiple layers
+- **Conditional batch norm**: Batch-specific normalization
+
+## Advanced Theoretical Concepts
+
+### Transfer Learning (scArches)
+
+**Concept**: Use pretrained model as initialization for new data
+
+**Process**:
+1. Train reference model on large dataset
+2. Freeze encoder parameters
+3. Fine-tune decoder on query data
+4. Or fine-tune all with lower learning rate
+
+**Why it works**:
+- Encoder learns general cellular representations
+- Decoder adapts to query-specific characteristics
+- Prevents catastrophic forgetting
+
+**Applications**:
+- Query-to-reference mapping
+- Few-shot learning for rare cell types
+- Rapid analysis of new datasets
+
+### Multi-Resolution Modeling (MrVI)
+
+**Idea**: Separate shared and sample-specific variation
+
+**Latent space decomposition**:
+```
+z = z_shared + z_sample
+```
+- **z_shared**: Common across samples
+- **z_sample**: Sample-specific effects
+
+**Hierarchical structure**:
+```
+Sample level: ρ_s ~ N(0, I)
+Cell level: z_i ~ N(ρ_{s(i)}, σ²)
+```
+
+**Benefits**:
+- Disentangle biological sources of variation
+- Compare samples at different resolutions
+- Identify sample-specific cell states
+
+### Counterfactual Prediction
+
+**Goal**: Predict outcome under different conditions
+
+**Example**: "What would this cell look like if from different batch?"
+
+**Method**:
+1. Encode cell to latent: z = Encoder(x, s_original)
+2. Decode with new condition: x_new = Decoder(z, s_new)
+3. x_new is counterfactual prediction
+
+**Applications**:
+- Batch effect assessment
+- Predicting treatment response
+- In silico perturbation studies
+
+### Posterior Predictive Distribution
+
+**Definition**: Distribution of new data given observed data
+
+```
+p(x_new | x_observed) = ∫ p(x_new|z) q(z|x_observed) dz
+```
+
+**Estimation**: Sample z from q(z|x), generate x_new from p(x_new|z)
+
+**Uses**:
+- Uncertainty quantification
+- Robust predictions
+- Outlier detection
+
+## Differential Expression Framework
+
+### Bayesian Approach
+
+**Traditional methods**: Compare point estimates
+- Wilcoxon, t-test, etc.
+- Ignore uncertainty
+- Require pseudocounts
+
+**scvi-tools approach**: Compare distributions
+- Sample from posterior: μ_A ~ p(μ|x_A), μ_B ~ p(μ|x_B)
+- Compute log fold-change: LFC = log(μ_B) - log(μ_A)
+- Posterior distribution of LFC quantifies uncertainty
+
+### Bayes Factor
+
+**Definition**: Ratio of posterior odds to prior odds
+
+```
+BF = P(H₁|data) / P(H₀|data)
+     ─────────────────────────
+     P(H₁) / P(H₀)
+```
+
+**Interpretation**:
+- BF > 3: Moderate evidence for H₁
+- BF > 10: Strong evidence
+- BF > 100: Decisive evidence
+
+**In scvi-tools**: Used to rank genes by evidence for DE
+
+### False Discovery Proportion (FDP)
+
+**Goal**: Control expected false discovery rate
+
+**Procedure**:
+1. For each gene, compute posterior probability of DE
+2. Rank genes by evidence (Bayes factor)
+3. Select top k genes such that E[FDP] ≤ α
+
+**Advantage over p-values**:
+- Fully Bayesian
+- Natural for posterior inference
+- No arbitrary thresholds
+
+## Implementation Details
+
+### Optimization
+
+**Optimizer**: Adam (adaptive learning rates)
+- Default lr = 0.001
+- Momentum parameters: β₁=0.9, β₂=0.999
+
+**Training loop**:
+1. Sample mini-batch of cells
+2. Compute ELBO loss
+3. Backpropagate gradients
+4. Update parameters with Adam
+5. Repeat until convergence
+
+**Convergence criteria**:
+- ELBO plateaus on validation set
+- Early stopping prevents overfitting
+- Typically 200-500 epochs
+
+### Regularization
+
+**KL annealing**: Gradually increase KL weight
+- Prevents posterior collapse
+- Starts at 0, increases to 1 over epochs
+
+**Dropout**: Random neuron dropping during training
+- Default: 0.1 dropout rate
+- Prevents overfitting
+- Improves generalization
+
+**Weight decay**: L2 regularization on weights
+- Prevents large weights
+- Improves stability
+
+### Scalability
+
+**Mini-batch training**:
+- Process subset of cells per iteration
+- Batch size: 64-256 cells
+- Enables scaling to millions of cells
+
+**Stochastic optimization**:
+- Estimates ELBO on mini-batches
+- Unbiased gradient estimates
+- Converges to optimal solution
+
+**GPU acceleration**:
+- Neural networks naturally parallelize
+- Order of magnitude speedup
+- Essential for large datasets
+
+## Connections to Other Methods
+
+### vs. PCA
+- **PCA**: Linear, deterministic
+- **scVI**: Nonlinear, probabilistic
+- **Advantage**: scVI captures complex structure, handles counts
+
+### vs. t-SNE/UMAP
+- **t-SNE/UMAP**: Visualization-focused
+- **scVI**: Full generative model
+- **Advantage**: scVI enables downstream tasks (DE, imputation)
+
+### vs. Seurat Integration
+- **Seurat**: Anchor-based alignment
+- **scVI**: Probabilistic modeling
+- **Advantage**: scVI provides uncertainty, works for multiple batches
+
+### vs. Harmony
+- **Harmony**: PCA + batch correction
+- **scVI**: VAE-based
+- **Advantage**: scVI handles counts natively, more flexible
+
+## Mathematical Notation
+
+**Common symbols**:
+- x: Observed gene expression (counts)
+- z: Latent representation
+- θ: Model parameters
+- q(z|x): Approximate posterior (encoder)
+- p(x|z): Likelihood (decoder)
+- p(z): Prior on latent variables
+- μ, σ²: Mean and variance
+- π: Dropout probability (ZINB)
+- θ (in NB): Dispersion parameter
+- s: Batch/covariate indicator
+
+## Further Reading
+
+**Key Papers**:
+1. Lopez et al. (2018): "Deep generative modeling for single-cell transcriptomics"
+2. Xu et al. (2021): "Probabilistic harmonization and annotation of single-cell transcriptomics"
+3. Boyeau et al. (2019): "Deep generative models for detecting differential expression in single cells"
+
+**Concepts to explore**:
+- Variational inference in machine learning
+- Bayesian deep learning
+- Information theory (KL divergence, mutual information)
+- Generative models (GANs, normalizing flows, diffusion models)
+- Probabilistic programming (Pyro, PyTorch)
+
+**Mathematical background**:
+- Probability theory and statistics
+- Linear algebra and calculus
+- Optimization theory
+- Information theory
--- a/skills/scvi-tools/references/workflows.md
+++ b/skills/scvi-tools/references/workflows.md
@@ -0,0 +1,546 @@
+# Common Workflows and Best Practices
+
+This document covers common workflows, best practices, and advanced usage patterns for scvi-tools.
+
+## Standard Analysis Workflow
+
+### 1. Data Loading and Preparation
+
+```python
+import scvi
+import scanpy as sc
+import numpy as np
+
+# Load data (AnnData format required)
+adata = sc.read_h5ad("data.h5ad")
+# Or load from other formats
+# adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")
+# adata = sc.read_csv("counts.csv")
+
+# Basic QC metrics
+sc.pp.calculate_qc_metrics(adata, inplace=True)
+adata.var['mt'] = adata.var_names.str.startswith('MT-')
+sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
+```
+
+### 2. Quality Control
+
+```python
+# Filter cells
+sc.pp.filter_cells(adata, min_genes=200)
+sc.pp.filter_cells(adata, max_genes=5000)
+
+# Filter genes
+sc.pp.filter_genes(adata, min_cells=3)
+
+# Filter by mitochondrial content
+adata = adata[adata.obs['pct_counts_mt'] < 20, :]
+
+# Remove doublets (optional, before training)
+sc.external.pp.scrublet(adata)
+adata = adata[~adata.obs['predicted_doublet'], :]
+```
+
+### 3. Preprocessing for scvi-tools
+
+```python
+# IMPORTANT: scvi-tools needs RAW counts
+# If you've already normalized, use the raw layer or reload data
+
+# Save raw counts if not already available
+if 'counts' not in adata.layers:
+    adata.layers['counts'] = adata.X.copy()
+
+# Feature selection (optional but recommended)
+sc.pp.highly_variable_genes(
+    adata,
+    n_top_genes=4000,
+    subset=False,  # Keep all genes, just mark HVGs
+    batch_key="batch"  # If multiple batches
+)
+
+# Filter to HVGs (optional)
+# adata = adata[:, adata.var['highly_variable']]
+```
+
+### 4. Register Data with scvi-tools
+
+```python
+# Setup AnnData for scvi-tools
+scvi.model.SCVI.setup_anndata(
+    adata,
+    layer="counts",  # Use raw counts
+    batch_key="batch",  # Technical batches
+    categorical_covariate_keys=["donor", "condition"],
+    continuous_covariate_keys=["percent_mito", "n_counts"]
+)
+
+# Check registration
+adata.uns['_scvi']['summary_stats']
+```
+
+### 5. Model Training
+
+```python
+# Create model
+model = scvi.model.SCVI(
+    adata,
+    n_latent=30,  # Latent dimensions
+    n_layers=2,   # Network depth
+    n_hidden=128, # Hidden layer size
+    dropout_rate=0.1,
+    gene_likelihood="zinb"  # zero-inflated negative binomial
+)
+
+# Train model
+model.train(
+    max_epochs=400,
+    batch_size=128,
+    train_size=0.9,
+    early_stopping=True,
+    check_val_every_n_epoch=10
+)
+
+# View training history
+train_history = model.history["elbo_train"]
+val_history = model.history["elbo_validation"]
+```
+
+### 6. Extract Results
+
+```python
+# Get latent representation
+latent = model.get_latent_representation()
+adata.obsm["X_scVI"] = latent
+
+# Get normalized expression
+normalized = model.get_normalized_expression(
+    adata,
+    library_size=1e4,
+    n_samples=25  # Monte Carlo samples
+)
+adata.layers["scvi_normalized"] = normalized
+```
+
+### 7. Downstream Analysis
+
+```python
+# Clustering on scVI latent space
+sc.pp.neighbors(adata, use_rep="X_scVI", n_neighbors=15)
+sc.tl.umap(adata, min_dist=0.3)
+sc.tl.leiden(adata, resolution=0.8, key_added="leiden")
+
+# Visualization
+sc.pl.umap(adata, color=["leiden", "batch", "cell_type"])
+
+# Differential expression
+de_results = model.differential_expression(
+    groupby="leiden",
+    group1="0",
+    group2="1",
+    mode="change",
+    delta=0.25
+)
+```
+
+### 8. Model Persistence
+
+```python
+# Save model
+model_dir = "./scvi_model/"
+model.save(model_dir, overwrite=True)
+
+# Save AnnData with results
+adata.write("analyzed_data.h5ad")
+
+# Load model later
+model = scvi.model.SCVI.load(model_dir, adata=adata)
+```
+
+## Hyperparameter Tuning
+
+### Key Hyperparameters
+
+**Architecture**:
+- `n_latent`: Latent space dimensionality (10-50)
+  - Larger for complex, heterogeneous datasets
+  - Smaller for simple datasets or to prevent overfitting
+- `n_layers`: Number of hidden layers (1-3)
+  - More layers for complex data, but diminishing returns
+- `n_hidden`: Nodes per hidden layer (64-256)
+  - Scale with dataset size and complexity
+
+**Training**:
+- `max_epochs`: Training iterations (200-500)
+  - Use early stopping to prevent overfitting
+- `batch_size`: Samples per batch (64-256)
+  - Larger for big datasets, smaller for limited memory
+- `lr`: Learning rate (0.001 default, usually good)
+
+**Model-specific**:
+- `gene_likelihood`: Distribution ("zinb", "nb", "poisson")
+  - "zinb" for sparse data with zero-inflation
+  - "nb" for less sparse data
+- `dispersion`: Gene or gene-batch specific
+  - "gene" for simple, "gene-batch" for complex batch effects
+
+### Hyperparameter Search Example
+
+```python
+from scvi.model import SCVI
+
+# Define search space
+latent_dims = [10, 20, 30]
+n_layers_options = [1, 2]
+
+best_score = float('-inf')
+best_params = None
+
+for n_latent in latent_dims:
+    for n_layers in n_layers_options:
+        model = SCVI(
+            adata,
+            n_latent=n_latent,
+            n_layers=n_layers
+        )
+        model.train(max_epochs=200)
+
+        # Evaluate on validation set
+        val_elbo = model.history["elbo_validation"][-1]
+
+        if val_elbo > best_score:
+            best_score = val_elbo
+            best_params = {"n_latent": n_latent, "n_layers": n_layers}
+
+print(f"Best params: {best_params}")
+```
+
+### Using Optuna for Hyperparameter Optimization
+
+```python
+import optuna
+
+def objective(trial):
+    n_latent = trial.suggest_int("n_latent", 10, 50)
+    n_layers = trial.suggest_int("n_layers", 1, 3)
+    n_hidden = trial.suggest_categorical("n_hidden", [64, 128, 256])
+
+    model = scvi.model.SCVI(
+        adata,
+        n_latent=n_latent,
+        n_layers=n_layers,
+        n_hidden=n_hidden
+    )
+
+    model.train(max_epochs=200, early_stopping=True)
+    return model.history["elbo_validation"][-1]
+
+study = optuna.create_study(direction="maximize")
+study.optimize(objective, n_trials=20)
+
+print(f"Best parameters: {study.best_params}")
+```
+
+## GPU Acceleration
+
+### Enable GPU Training
+
+```python
+# Automatic GPU detection
+model = scvi.model.SCVI(adata)
+model.train(accelerator="auto")  # Uses GPU if available
+
+# Force GPU
+model.train(accelerator="gpu")
+
+# Multi-GPU
+model.train(accelerator="gpu", devices=2)
+
+# Check if GPU is being used
+import torch
+print(f"CUDA available: {torch.cuda.is_available()}")
+print(f"GPU count: {torch.cuda.device_count()}")
+```
+
+### GPU Memory Management
+
+```python
+# Reduce batch size if OOM
+model.train(batch_size=64)  # Instead of default 128
+
+# Mixed precision training (saves memory)
+model.train(precision=16)
+
+# Clear cache between runs
+import torch
+torch.cuda.empty_cache()
+```
+
+## Batch Integration Strategies
+
+### Strategy 1: Simple Batch Key
+
+```python
+# For standard batch correction
+scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
+model = scvi.model.SCVI(adata)
+```
+
+### Strategy 2: Multiple Covariates
+
+```python
+# Correct for multiple technical factors
+scvi.model.SCVI.setup_anndata(
+    adata,
+    batch_key="sequencing_batch",
+    categorical_covariate_keys=["donor", "tissue"],
+    continuous_covariate_keys=["percent_mito"]
+)
+```
+
+### Strategy 3: Hierarchical Batches
+
+```python
+# When batches have hierarchical structure
+# E.g., samples within studies
+adata.obs["batch_hierarchy"] = (
+    adata.obs["study"].astype(str) + "_" +
+    adata.obs["sample"].astype(str)
+)
+
+scvi.model.SCVI.setup_anndata(adata, batch_key="batch_hierarchy")
+```
+
+## Reference Mapping (scArches)
+
+### Training Reference Model
+
+```python
+# Train on reference dataset
+scvi.model.SCVI.setup_anndata(ref_adata, batch_key="batch")
+ref_model = scvi.model.SCVI(ref_adata)
+ref_model.train()
+
+# Save reference
+ref_model.save("reference_model")
+```
+
+### Mapping Query to Reference
+
+```python
+# Load reference
+ref_model = scvi.model.SCVI.load("reference_model", adata=ref_adata)
+
+# Setup query with same parameters
+scvi.model.SCVI.setup_anndata(query_adata, batch_key="batch")
+
+# Transfer learning
+query_model = scvi.model.SCVI.load_query_data(
+    query_adata,
+    "reference_model"
+)
+
+# Fine-tune on query (optional)
+query_model.train(max_epochs=200)
+
+# Get query embeddings
+query_latent = query_model.get_latent_representation()
+
+# Transfer labels using KNN
+from sklearn.neighbors import KNeighborsClassifier
+
+knn = KNeighborsClassifier(n_neighbors=15)
+knn.fit(ref_model.get_latent_representation(), ref_adata.obs["cell_type"])
+query_adata.obs["predicted_cell_type"] = knn.predict(query_latent)
+```
+
+## Model Minification
+
+Reduce model size for faster inference:
+
+```python
+# Train full model
+model = scvi.model.SCVI(adata)
+model.train()
+
+# Minify for deployment
+minified = model.minify_adata(adata)
+
+# Save minified version
+minified.write("minified_data.h5ad")
+model.save("minified_model")
+
+# Load and use (much faster)
+mini_model = scvi.model.SCVI.load("minified_model", adata=minified)
+```
+
+## Memory-Efficient Data Loading
+
+### Using AnnDataLoader
+
+```python
+from scvi.data import AnnDataLoader
+
+# For very large datasets
+dataloader = AnnDataLoader(
+    adata,
+    batch_size=128,
+    shuffle=True,
+    drop_last=False
+)
+
+# Custom training loop (advanced)
+for batch in dataloader:
+    # Process batch
+    pass
+```
+
+### Using Backed AnnData
+
+```python
+# For data too large for memory
+adata = sc.read_h5ad("huge_dataset.h5ad", backed='r')
+
+# scvi-tools works with backed mode
+scvi.model.SCVI.setup_anndata(adata)
+model = scvi.model.SCVI(adata)
+model.train()
+```
+
+## Model Interpretation
+
+### Feature Importance with SHAP
+
+```python
+import shap
+
+# Get SHAP values for interpretability
+explainer = shap.DeepExplainer(model.module, background_data)
+shap_values = explainer.shap_values(test_data)
+
+# Visualize
+shap.summary_plot(shap_values, feature_names=adata.var_names)
+```
+
+### Gene Correlation Analysis
+
+```python
+# Get gene-gene correlation matrix
+correlation = model.get_feature_correlation_matrix(
+    adata,
+    transform_batch="batch1"
+)
+
+# Visualize top correlated genes
+import seaborn as sns
+sns.heatmap(correlation[:50, :50], cmap="coolwarm")
+```
+
+## Troubleshooting Common Issues
+
+### Issue: NaN Loss During Training
+
+**Causes**:
+- Learning rate too high
+- Unnormalized input (must use raw counts)
+- Data quality issues
+
+**Solutions**:
+```python
+# Reduce learning rate
+model.train(lr=0.0001)
+
+# Check data
+assert adata.X.min() >= 0  # No negative values
+assert np.isnan(adata.X).sum() == 0  # No NaNs
+
+# Use more stable likelihood
+model = scvi.model.SCVI(adata, gene_likelihood="nb")
+```
+
+### Issue: Poor Batch Correction
+
+**Solutions**:
+```python
+# Increase batch effect modeling
+model = scvi.model.SCVI(
+    adata,
+    encode_covariates=True,  # Encode batch in encoder
+    deeply_inject_covariates=False
+)
+
+# Or try opposite
+model = scvi.model.SCVI(adata, deeply_inject_covariates=True)
+
+# Use more latent dimensions
+model = scvi.model.SCVI(adata, n_latent=50)
+```
+
+### Issue: Model Not Training (ELBO Not Decreasing)
+
+**Solutions**:
+```python
+# Increase learning rate
+model.train(lr=0.005)
+
+# Increase network capacity
+model = scvi.model.SCVI(adata, n_hidden=256, n_layers=2)
+
+# Train longer
+model.train(max_epochs=500)
+```
+
+### Issue: Out of Memory (OOM)
+
+**Solutions**:
+```python
+# Reduce batch size
+model.train(batch_size=64)
+
+# Use mixed precision
+model.train(precision=16)
+
+# Reduce model size
+model = scvi.model.SCVI(adata, n_latent=10, n_hidden=64)
+
+# Use backed AnnData
+adata = sc.read_h5ad("data.h5ad", backed='r')
+```
+
+## Performance Benchmarking
+
+```python
+import time
+
+# Time training
+start = time.time()
+model.train(max_epochs=400)
+training_time = time.time() - start
+print(f"Training time: {training_time:.2f}s")
+
+# Time inference
+start = time.time()
+latent = model.get_latent_representation()
+inference_time = time.time() - start
+print(f"Inference time: {inference_time:.2f}s")
+
+# Memory usage
+import psutil
+import os
+process = psutil.Process(os.getpid())
+memory_gb = process.memory_info().rss / 1024**3
+print(f"Memory usage: {memory_gb:.2f} GB")
+```
+
+## Best Practices Summary
+
+1. **Always use raw counts**: Never log-normalize before scvi-tools
+2. **Feature selection**: Use highly variable genes for efficiency
+3. **Batch correction**: Register all known technical covariates
+4. **Early stopping**: Use validation set to prevent overfitting
+5. **Model saving**: Always save trained models
+6. **GPU usage**: Use GPU for large datasets (>10k cells)
+7. **Hyperparameter tuning**: Start with defaults, tune if needed
+8. **Validation**: Check batch correction visually (UMAP colored by batch)
+9. **Documentation**: Keep track of preprocessing steps
+10. **Reproducibility**: Set random seeds (`scvi.settings.seed = 0`)