Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/scvi-tools/references/models-atac-seq.md
+++ b/skills/scvi-tools/references/models-atac-seq.md
@@ -0,0 +1,321 @@
+# ATAC-seq and Chromatin Accessibility Models
+
+This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.
+
+## PeakVI
+
+**Purpose**: Analysis and integration of single-cell ATAC-seq data using peak counts.
+
+**Key Features**:
+- Variational autoencoder specifically designed for scATAC-seq peak data
+- Learns low-dimensional representations of chromatin accessibility
+- Performs batch correction across samples
+- Enables differential accessibility testing
+- Integrates multiple ATAC-seq datasets
+
+**When to Use**:
+- Analyzing scATAC-seq peak count matrices
+- Integrating multiple ATAC-seq experiments
+- Batch correction of chromatin accessibility data
+- Dimensionality reduction for ATAC-seq
+- Differential accessibility analysis between cell types or conditions
+
+**Data Requirements**:
+- Peak count matrix (cells × peaks)
+- Binary or count data for peak accessibility
+- Batch/sample annotations (optional, for batch correction)
+
+**Basic Usage**:
+```python
+import scvi
+
+# Prepare data (peaks should be in adata.X)
+# Optional: filter peaks
+sc.pp.filter_genes(adata, min_cells=3)
+
+# Setup data
+scvi.model.PEAKVI.setup_anndata(
+    adata,
+    batch_key="batch"
+)
+
+# Train model
+model = scvi.model.PEAKVI(adata)
+model.train()
+
+# Get latent representation (batch-corrected)
+latent = model.get_latent_representation()
+adata.obsm["X_PeakVI"] = latent
+
+# Differential accessibility
+da_results = model.differential_accessibility(
+    groupby="cell_type",
+    group1="TypeA",
+    group2="TypeB"
+)
+```
+
+**Key Parameters**:
+- `n_latent`: Dimensionality of latent space (default: 10)
+- `n_hidden`: Number of nodes per hidden layer (default: 128)
+- `n_layers`: Number of hidden layers (default: 1)
+- `region_factors`: Whether to learn region-specific factors (default: True)
+- `latent_distribution`: Distribution for latent space ("normal" or "ln")
+
+**Outputs**:
+- `get_latent_representation()`: Low-dimensional embeddings for cells
+- `get_accessibility_estimates()`: Normalized accessibility values
+- `differential_accessibility()`: Statistical testing for differential peaks
+- `get_region_factors()`: Peak-specific scaling factors
+
+**Best Practices**:
+1. Filter out low-quality peaks (present in very few cells)
+2. Include batch information if integrating multiple samples
+3. Use latent representations for clustering and UMAP visualization
+4. Consider using `region_factors=True` for datasets with high technical variation
+5. Store latent embeddings in `adata.obsm` for downstream analysis with scanpy
+
+## PoissonVI
+
+**Purpose**: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).
+
+**Key Features**:
+- Models fragment counts directly (not just peak presence/absence)
+- Poisson distribution for count data
+- Captures quantitative differences in accessibility
+- Enables fine-grained analysis of chromatin state
+
+**When to Use**:
+- Analyzing fragment-level ATAC-seq data
+- Need quantitative accessibility measurements
+- Higher resolution analysis than binary peak calls
+- Investigating gradual changes in chromatin accessibility
+
+**Data Requirements**:
+- Fragment count matrix (cells × genomic regions)
+- Count data (not binary)
+
+**Basic Usage**:
+```python
+scvi.model.POISSONVI.setup_anndata(
+    adata,
+    batch_key="batch"
+)
+
+model = scvi.model.POISSONVI(adata)
+model.train()
+
+# Get results
+latent = model.get_latent_representation()
+accessibility = model.get_accessibility_estimates()
+```
+
+**Key Differences from PeakVI**:
+- **PeakVI**: Best for standard peak count matrices, faster
+- **PoissonVI**: Best for quantitative fragment counts, more detailed
+
+**When to Choose PoissonVI over PeakVI**:
+- Working with fragment counts rather than called peaks
+- Need to capture quantitative differences
+- Have high-quality, high-coverage data
+- Interested in subtle accessibility changes
+
+## scBasset
+
+**Purpose**: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.
+
+**Key Features**:
+- Convolutional neural network (CNN) architecture for sequence-based analysis
+- Models raw DNA sequences, not just peak counts
+- Enables motif discovery and transcription factor (TF) binding prediction
+- Provides interpretable feature importance
+- Performs batch correction
+
+**When to Use**:
+- Want to incorporate DNA sequence information
+- Interested in TF motif analysis
+- Need interpretable models (which sequences drive accessibility)
+- Analyzing regulatory elements and TF binding sites
+- Predicting accessibility from sequence alone
+
+**Data Requirements**:
+- Peak sequences (extracted from genome)
+- Peak accessibility matrix
+- Genome reference (for sequence extraction)
+
+**Basic Usage**:
+```python
+# scBasset requires sequence information
+# First, extract sequences for peaks
+from scbasset import utils
+sequences = utils.fetch_sequences(adata, genome="hg38")
+
+# Setup and train
+scvi.model.SCBASSET.setup_anndata(
+    adata,
+    batch_key="batch"
+)
+
+model = scvi.model.SCBASSET(adata, sequences=sequences)
+model.train()
+
+# Get latent representation
+latent = model.get_latent_representation()
+
+# Interpret model: which sequences/motifs are important
+importance_scores = model.get_feature_importance()
+```
+
+**Key Parameters**:
+- `n_latent`: Latent space dimensionality
+- `conv_layers`: Number of convolutional layers
+- `n_filters`: Number of filters per conv layer
+- `filter_size`: Size of convolutional filters
+
+**Advanced Features**:
+- **In silico mutagenesis**: Predict how sequence changes affect accessibility
+- **Motif enrichment**: Identify enriched TF motifs in accessible regions
+- **Batch correction**: Similar to other scvi-tools models
+- **Transfer learning**: Fine-tune on new datasets
+
+**Interpretability Tools**:
+```python
+# Get importance scores for sequences
+importance = model.get_sequence_importance(region_indices=[0, 1, 2])
+
+# Predict accessibility for new sequences
+predictions = model.predict_accessibility(new_sequences)
+```
+
+## Model Selection for ATAC-seq
+
+### PeakVI
+**Choose when**:
+- Standard scATAC-seq analysis workflow
+- Have peak count matrices (most common format)
+- Need fast, efficient batch correction
+- Want straightforward differential accessibility
+- Prioritize computational efficiency
+
+**Advantages**:
+- Fast training and inference
+- Proven track record for scATAC-seq
+- Easy integration with scanpy workflow
+- Robust batch correction
+
+### PoissonVI
+**Choose when**:
+- Have fragment-level count data
+- Need quantitative accessibility measures
+- Interested in subtle differences
+- Have high-coverage, high-quality data
+
+**Advantages**:
+- More detailed quantitative information
+- Better for gradient changes
+- Appropriate statistical model for counts
+
+### scBasset
+**Choose when**:
+- Want to incorporate DNA sequence
+- Need biological interpretation (motifs, TFs)
+- Interested in regulatory mechanisms
+- Have computational resources for CNN training
+- Want predictive power for new sequences
+
+**Advantages**:
+- Sequence-based, biologically interpretable
+- Motif and TF analysis built-in
+- Predictive modeling capabilities
+- In silico perturbation experiments
+
+## Workflow Example: Complete ATAC-seq Analysis
+
+```python
+import scvi
+import scanpy as sc
+
+# 1. Load and preprocess ATAC-seq data
+adata = sc.read_h5ad("atac_data.h5ad")
+
+# 2. Filter low-quality peaks
+sc.pp.filter_genes(adata, min_cells=10)
+
+# 3. Setup and train PeakVI
+scvi.model.PEAKVI.setup_anndata(
+    adata,
+    batch_key="sample"
+)
+
+model = scvi.model.PEAKVI(adata, n_latent=20)
+model.train(max_epochs=400)
+
+# 4. Extract latent representation
+latent = model.get_latent_representation()
+adata.obsm["X_PeakVI"] = latent
+
+# 5. Downstream analysis
+sc.pp.neighbors(adata, use_rep="X_PeakVI")
+sc.tl.umap(adata)
+sc.tl.leiden(adata, key_added="clusters")
+
+# 6. Differential accessibility
+da_results = model.differential_accessibility(
+    groupby="clusters",
+    group1="0",
+    group2="1"
+)
+
+# 7. Save model
+model.save("peakvi_model")
+```
+
+## Integration with Gene Expression (RNA+ATAC)
+
+For paired multimodal data (RNA+ATAC from same cells), use **MultiVI** instead:
+
+```python
+# For 10x Multiome or similar paired data
+scvi.model.MULTIVI.setup_anndata(
+    adata,
+    batch_key="sample",
+    modality_key="modality"  # "RNA" or "ATAC"
+)
+
+model = scvi.model.MULTIVI(adata)
+model.train()
+
+# Get joint latent space
+latent = model.get_latent_representation()
+```
+
+See `models-multimodal.md` for more details on multimodal integration.
+
+## Best Practices for ATAC-seq Analysis
+
+1. **Quality Control**:
+   - Filter cells with very low or very high peak counts
+   - Remove peaks present in very few cells
+   - Filter mitochondrial and sex chromosome peaks if needed
+
+2. **Batch Correction**:
+   - Always include `batch_key` if integrating multiple samples
+   - Consider technical covariates (sequencing depth, TSS enrichment)
+
+3. **Feature Selection**:
+   - Unlike RNA-seq, all peaks are often used
+   - Consider filtering very rare peaks for efficiency
+
+4. **Latent Dimensions**:
+   - Start with `n_latent=10-30` depending on dataset complexity
+   - Larger values for more heterogeneous datasets
+
+5. **Downstream Analysis**:
+   - Use latent representations for clustering and visualization
+   - Link peaks to genes for regulatory analysis
+   - Perform motif enrichment on cluster-specific peaks
+
+6. **Computational Considerations**:
+   - ATAC-seq matrices are often very large (many peaks)
+   - Consider downsampling peaks for initial exploration
+   - Use GPU acceleration for large datasets