Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/scvi-tools/SKILL.md
+++ b/skills/scvi-tools/SKILL.md
@@ -0,0 +1,184 @@
+---
+name: scvi-tools
+description: This skill should be used when working with single-cell omics data analysis using scvi-tools, including scRNA-seq, scATAC-seq, CITE-seq, spatial transcriptomics, and other single-cell modalities. Use this skill for probabilistic modeling, batch correction, dimensionality reduction, differential expression, cell type annotation, multimodal integration, and spatial analysis tasks.
+---
+
+# scvi-tools
+
+## Overview
+
+scvi-tools is a comprehensive Python framework for probabilistic models in single-cell genomics. Built on PyTorch and PyTorch Lightning, it provides deep generative models using variational inference for analyzing diverse single-cell data modalities.
+
+## When to Use This Skill
+
+Use this skill when:
+- Analyzing single-cell RNA-seq data (dimensionality reduction, batch correction, integration)
+- Working with single-cell ATAC-seq or chromatin accessibility data
+- Integrating multimodal data (CITE-seq, multiome, paired/unpaired datasets)
+- Analyzing spatial transcriptomics data (deconvolution, spatial mapping)
+- Performing differential expression analysis on single-cell data
+- Conducting cell type annotation or transfer learning tasks
+- Working with specialized single-cell modalities (methylation, cytometry, RNA velocity)
+- Building custom probabilistic models for single-cell analysis
+
+## Core Capabilities
+
+scvi-tools provides models organized by data modality:
+
+### 1. Single-Cell RNA-seq Analysis
+Core models for expression analysis, batch correction, and integration. See `references/models-scrna-seq.md` for:
+- **scVI**: Unsupervised dimensionality reduction and batch correction
+- **scANVI**: Semi-supervised cell type annotation and integration
+- **AUTOZI**: Zero-inflation detection and modeling
+- **VeloVI**: RNA velocity analysis
+- **contrastiveVI**: Perturbation effect isolation
+
+### 2. Chromatin Accessibility (ATAC-seq)
+Models for analyzing single-cell chromatin data. See `references/models-atac-seq.md` for:
+- **PeakVI**: Peak-based ATAC-seq analysis and integration
+- **PoissonVI**: Quantitative fragment count modeling
+- **scBasset**: Deep learning approach with motif analysis
+
+### 3. Multimodal & Multi-omics Integration
+Joint analysis of multiple data types. See `references/models-multimodal.md` for:
+- **totalVI**: CITE-seq protein and RNA joint modeling
+- **MultiVI**: Paired and unpaired multi-omic integration
+- **MrVI**: Multi-resolution cross-sample analysis
+
+### 4. Spatial Transcriptomics
+Spatially-resolved transcriptomics analysis. See `references/models-spatial.md` for:
+- **DestVI**: Multi-resolution spatial deconvolution
+- **Stereoscope**: Cell type deconvolution
+- **Tangram**: Spatial mapping and integration
+- **scVIVA**: Cell-environment relationship analysis
+
+### 5. Specialized Modalities
+Additional specialized analysis tools. See `references/models-specialized.md` for:
+- **MethylVI/MethylANVI**: Single-cell methylation analysis
+- **CytoVI**: Flow/mass cytometry batch correction
+- **Solo**: Doublet detection
+- **CellAssign**: Marker-based cell type annotation
+
+## Typical Workflow
+
+All scvi-tools models follow a consistent API pattern:
+
+```python
+# 1. Load and preprocess data (AnnData format)
+import scvi
+import scanpy as sc
+
+adata = scvi.data.heart_cell_atlas_subsampled()
+sc.pp.filter_genes(adata, min_counts=3)
+sc.pp.highly_variable_genes(adata, n_top_genes=1200)
+
+# 2. Register data with model (specify layers, covariates)
+scvi.model.SCVI.setup_anndata(
+    adata,
+    layer="counts",  # Use raw counts, not log-normalized
+    batch_key="batch",
+    categorical_covariate_keys=["donor"],
+    continuous_covariate_keys=["percent_mito"]
+)
+
+# 3. Create and train model
+model = scvi.model.SCVI(adata)
+model.train()
+
+# 4. Extract latent representations and normalized values
+latent = model.get_latent_representation()
+normalized = model.get_normalized_expression(library_size=1e4)
+
+# 5. Store in AnnData for downstream analysis
+adata.obsm["X_scVI"] = latent
+adata.layers["scvi_normalized"] = normalized
+
+# 6. Downstream analysis with scanpy
+sc.pp.neighbors(adata, use_rep="X_scVI")
+sc.tl.umap(adata)
+sc.tl.leiden(adata)
+```
+
+**Key Design Principles:**
+- **Raw counts required**: Models expect unnormalized count data for optimal performance
+- **Unified API**: Consistent interface across all models (setup → train → extract)
+- **AnnData-centric**: Seamless integration with the scanpy ecosystem
+- **GPU acceleration**: Automatic utilization of available GPUs
+- **Batch correction**: Handle technical variation through covariate registration
+
+## Common Analysis Tasks
+
+### Differential Expression
+Probabilistic DE analysis using the learned generative models:
+
+```python
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="TypeA",
+    group2="TypeB",
+    mode="change",  # Use composite hypothesis testing
+    delta=0.25      # Minimum effect size threshold
+)
+```
+
+See `references/differential-expression.md` for detailed methodology and interpretation.
+
+### Model Persistence
+Save and load trained models:
+
+```python
+# Save model
+model.save("./model_directory", overwrite=True)
+
+# Load model
+model = scvi.model.SCVI.load("./model_directory", adata=adata)
+```
+
+### Batch Correction and Integration
+Integrate datasets across batches or studies:
+
+```python
+# Register batch information
+scvi.model.SCVI.setup_anndata(adata, batch_key="study")
+
+# Model automatically learns batch-corrected representations
+model = scvi.model.SCVI(adata)
+model.train()
+latent = model.get_latent_representation()  # Batch-corrected
+```
+
+## Theoretical Foundations
+
+scvi-tools is built on:
+- **Variational inference**: Approximate posterior distributions for scalable Bayesian inference
+- **Deep generative models**: VAE architectures that learn complex data distributions
+- **Amortized inference**: Shared neural networks for efficient learning across cells
+- **Probabilistic modeling**: Principled uncertainty quantification and statistical testing
+
+See `references/theoretical-foundations.md` for detailed background on the mathematical framework.
+
+## Additional Resources
+
+- **Workflows**: `references/workflows.md` contains common workflows, best practices, hyperparameter tuning, and GPU optimization
+- **Model References**: Detailed documentation for each model category in the `references/` directory
+- **Official Documentation**: https://docs.scvi-tools.org/en/stable/
+- **Tutorials**: https://docs.scvi-tools.org/en/stable/tutorials/index.html
+- **API Reference**: https://docs.scvi-tools.org/en/stable/api/index.html
+
+## Installation
+
+```bash
+uv pip install scvi-tools
+# For GPU support
+uv pip install scvi-tools[cuda]
+```
+
+## Best Practices
+
+1. **Use raw counts**: Always provide unnormalized count data to models
+2. **Filter genes**: Remove low-count genes before analysis (e.g., `min_counts=3`)
+3. **Register covariates**: Include known technical factors (batch, donor, etc.) in `setup_anndata`
+4. **Feature selection**: Use highly variable genes for improved performance
+5. **Model saving**: Always save trained models to avoid retraining
+6. **GPU usage**: Enable GPU acceleration for large datasets (`accelerator="gpu"`)
+7. **Scanpy integration**: Store outputs in AnnData objects for downstream analysis