Files
gh-k-dense-ai-claude-scient…/skills/scvi-tools/references/models-atac-seq.md
2025-11-30 08:30:10 +08:00

322 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ATAC-seq and Chromatin Accessibility Models
This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.
## PeakVI
**Purpose**: Analysis and integration of single-cell ATAC-seq data using peak counts.
**Key Features**:
- Variational autoencoder specifically designed for scATAC-seq peak data
- Learns low-dimensional representations of chromatin accessibility
- Performs batch correction across samples
- Enables differential accessibility testing
- Integrates multiple ATAC-seq datasets
**When to Use**:
- Analyzing scATAC-seq peak count matrices
- Integrating multiple ATAC-seq experiments
- Batch correction of chromatin accessibility data
- Dimensionality reduction for ATAC-seq
- Differential accessibility analysis between cell types or conditions
**Data Requirements**:
- Peak count matrix (cells × peaks)
- Binary or count data for peak accessibility
- Batch/sample annotations (optional, for batch correction)
**Basic Usage**:
```python
import scvi
# Prepare data (peaks should be in adata.X)
# Optional: filter peaks
sc.pp.filter_genes(adata, min_cells=3)
# Setup data
scvi.model.PEAKVI.setup_anndata(
adata,
batch_key="batch"
)
# Train model
model = scvi.model.PEAKVI(adata)
model.train()
# Get latent representation (batch-corrected)
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent
# Differential accessibility
da_results = model.differential_accessibility(
groupby="cell_type",
group1="TypeA",
group2="TypeB"
)
```
**Key Parameters**:
- `n_latent`: Dimensionality of latent space (default: 10)
- `n_hidden`: Number of nodes per hidden layer (default: 128)
- `n_layers`: Number of hidden layers (default: 1)
- `region_factors`: Whether to learn region-specific factors (default: True)
- `latent_distribution`: Distribution for latent space ("normal" or "ln")
**Outputs**:
- `get_latent_representation()`: Low-dimensional embeddings for cells
- `get_accessibility_estimates()`: Normalized accessibility values
- `differential_accessibility()`: Statistical testing for differential peaks
- `get_region_factors()`: Peak-specific scaling factors
**Best Practices**:
1. Filter out low-quality peaks (present in very few cells)
2. Include batch information if integrating multiple samples
3. Use latent representations for clustering and UMAP visualization
4. Consider using `region_factors=True` for datasets with high technical variation
5. Store latent embeddings in `adata.obsm` for downstream analysis with scanpy
## PoissonVI
**Purpose**: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).
**Key Features**:
- Models fragment counts directly (not just peak presence/absence)
- Poisson distribution for count data
- Captures quantitative differences in accessibility
- Enables fine-grained analysis of chromatin state
**When to Use**:
- Analyzing fragment-level ATAC-seq data
- Need quantitative accessibility measurements
- Higher resolution analysis than binary peak calls
- Investigating gradual changes in chromatin accessibility
**Data Requirements**:
- Fragment count matrix (cells × genomic regions)
- Count data (not binary)
**Basic Usage**:
```python
scvi.model.POISSONVI.setup_anndata(
adata,
batch_key="batch"
)
model = scvi.model.POISSONVI(adata)
model.train()
# Get results
latent = model.get_latent_representation()
accessibility = model.get_accessibility_estimates()
```
**Key Differences from PeakVI**:
- **PeakVI**: Best for standard peak count matrices, faster
- **PoissonVI**: Best for quantitative fragment counts, more detailed
**When to Choose PoissonVI over PeakVI**:
- Working with fragment counts rather than called peaks
- Need to capture quantitative differences
- Have high-quality, high-coverage data
- Interested in subtle accessibility changes
## scBasset
**Purpose**: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.
**Key Features**:
- Convolutional neural network (CNN) architecture for sequence-based analysis
- Models raw DNA sequences, not just peak counts
- Enables motif discovery and transcription factor (TF) binding prediction
- Provides interpretable feature importance
- Performs batch correction
**When to Use**:
- Want to incorporate DNA sequence information
- Interested in TF motif analysis
- Need interpretable models (which sequences drive accessibility)
- Analyzing regulatory elements and TF binding sites
- Predicting accessibility from sequence alone
**Data Requirements**:
- Peak sequences (extracted from genome)
- Peak accessibility matrix
- Genome reference (for sequence extraction)
**Basic Usage**:
```python
# scBasset requires sequence information
# First, extract sequences for peaks
from scbasset import utils
sequences = utils.fetch_sequences(adata, genome="hg38")
# Setup and train
scvi.model.SCBASSET.setup_anndata(
adata,
batch_key="batch"
)
model = scvi.model.SCBASSET(adata, sequences=sequences)
model.train()
# Get latent representation
latent = model.get_latent_representation()
# Interpret model: which sequences/motifs are important
importance_scores = model.get_feature_importance()
```
**Key Parameters**:
- `n_latent`: Latent space dimensionality
- `conv_layers`: Number of convolutional layers
- `n_filters`: Number of filters per conv layer
- `filter_size`: Size of convolutional filters
**Advanced Features**:
- **In silico mutagenesis**: Predict how sequence changes affect accessibility
- **Motif enrichment**: Identify enriched TF motifs in accessible regions
- **Batch correction**: Similar to other scvi-tools models
- **Transfer learning**: Fine-tune on new datasets
**Interpretability Tools**:
```python
# Get importance scores for sequences
importance = model.get_sequence_importance(region_indices=[0, 1, 2])
# Predict accessibility for new sequences
predictions = model.predict_accessibility(new_sequences)
```
## Model Selection for ATAC-seq
### PeakVI
**Choose when**:
- Standard scATAC-seq analysis workflow
- Have peak count matrices (most common format)
- Need fast, efficient batch correction
- Want straightforward differential accessibility
- Prioritize computational efficiency
**Advantages**:
- Fast training and inference
- Proven track record for scATAC-seq
- Easy integration with scanpy workflow
- Robust batch correction
### PoissonVI
**Choose when**:
- Have fragment-level count data
- Need quantitative accessibility measures
- Interested in subtle differences
- Have high-coverage, high-quality data
**Advantages**:
- More detailed quantitative information
- Better for gradient changes
- Appropriate statistical model for counts
### scBasset
**Choose when**:
- Want to incorporate DNA sequence
- Need biological interpretation (motifs, TFs)
- Interested in regulatory mechanisms
- Have computational resources for CNN training
- Want predictive power for new sequences
**Advantages**:
- Sequence-based, biologically interpretable
- Motif and TF analysis built-in
- Predictive modeling capabilities
- In silico perturbation experiments
## Workflow Example: Complete ATAC-seq Analysis
```python
import scvi
import scanpy as sc
# 1. Load and preprocess ATAC-seq data
adata = sc.read_h5ad("atac_data.h5ad")
# 2. Filter low-quality peaks
sc.pp.filter_genes(adata, min_cells=10)
# 3. Setup and train PeakVI
scvi.model.PEAKVI.setup_anndata(
adata,
batch_key="sample"
)
model = scvi.model.PEAKVI(adata, n_latent=20)
model.train(max_epochs=400)
# 4. Extract latent representation
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent
# 5. Downstream analysis
sc.pp.neighbors(adata, use_rep="X_PeakVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")
# 6. Differential accessibility
da_results = model.differential_accessibility(
groupby="clusters",
group1="0",
group2="1"
)
# 7. Save model
model.save("peakvi_model")
```
## Integration with Gene Expression (RNA+ATAC)
For paired multimodal data (RNA+ATAC from same cells), use **MultiVI** instead:
```python
# For 10x Multiome or similar paired data
scvi.model.MULTIVI.setup_anndata(
adata,
batch_key="sample",
modality_key="modality" # "RNA" or "ATAC"
)
model = scvi.model.MULTIVI(adata)
model.train()
# Get joint latent space
latent = model.get_latent_representation()
```
See `models-multimodal.md` for more details on multimodal integration.
## Best Practices for ATAC-seq Analysis
1. **Quality Control**:
- Filter cells with very low or very high peak counts
- Remove peaks present in very few cells
- Filter mitochondrial and sex chromosome peaks if needed
2. **Batch Correction**:
- Always include `batch_key` if integrating multiple samples
- Consider technical covariates (sequencing depth, TSS enrichment)
3. **Feature Selection**:
- Unlike RNA-seq, all peaks are often used
- Consider filtering very rare peaks for efficiency
4. **Latent Dimensions**:
- Start with `n_latent=10-30` depending on dataset complexity
- Larger values for more heterogeneous datasets
5. **Downstream Analysis**:
- Use latent representations for clustering and visualization
- Link peaks to genes for regulatory analysis
- Perform motif enrichment on cluster-specific peaks
6. **Computational Considerations**:
- ATAC-seq matrices are often very large (many peaks)
- Consider downsampling peaks for initial exploration
- Use GPU acceleration for large datasets