# ATAC-seq and Chromatin Accessibility Models

This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.

## PeakVI

**Purpose**: Analysis and integration of single-cell ATAC-seq data using peak counts.

**Key Features**:
- Variational autoencoder specifically designed for scATAC-seq peak data
- Learns low-dimensional representations of chromatin accessibility
- Performs batch correction across samples
- Enables differential accessibility testing
- Integrates multiple ATAC-seq datasets

**When to Use**:
- Analyzing scATAC-seq peak count matrices
- Integrating multiple ATAC-seq experiments
- Batch correction of chromatin accessibility data
- Dimensionality reduction for ATAC-seq
- Differential accessibility analysis between cell types or conditions

**Data Requirements**:
- Peak count matrix (cells × peaks)
- Binary or count data for peak accessibility
- Batch/sample annotations (optional, for batch correction)

**Basic Usage**:
```python
import scvi

# Prepare data (peaks should be in adata.X)
# Optional: filter peaks
sc.pp.filter_genes(adata, min_cells=3)

# Setup data
scvi.model.PEAKVI.setup_anndata(
    adata,
    batch_key="batch"
)

# Train model
model = scvi.model.PEAKVI(adata)
model.train()

# Get latent representation (batch-corrected)
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent

# Differential accessibility
da_results = model.differential_accessibility(
    groupby="cell_type",
    group1="TypeA",
    group2="TypeB"
)
```

**Key Parameters**:
- `n_latent`: Dimensionality of latent space (default: 10)
- `n_hidden`: Number of nodes per hidden layer (default: 128)
- `n_layers`: Number of hidden layers (default: 1)
- `region_factors`: Whether to learn region-specific factors (default: True)
- `latent_distribution`: Distribution for latent space ("normal" or "ln")

**Outputs**:
- `get_latent_representation()`: Low-dimensional embeddings for cells
- `get_accessibility_estimates()`: Normalized accessibility values
- `differential_accessibility()`: Statistical testing for differential peaks
- `get_region_factors()`: Peak-specific scaling factors

**Best Practices**:
1. Filter out low-quality peaks (present in very few cells)
2. Include batch information if integrating multiple samples
3. Use latent representations for clustering and UMAP visualization
4. Consider using `region_factors=True` for datasets with high technical variation
5. Store latent embeddings in `adata.obsm` for downstream analysis with scanpy

## PoissonVI

**Purpose**: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).

**Key Features**:
- Models fragment counts directly (not just peak presence/absence)
- Poisson distribution for count data
- Captures quantitative differences in accessibility
- Enables fine-grained analysis of chromatin state

**When to Use**:
- Analyzing fragment-level ATAC-seq data
- Need quantitative accessibility measurements
- Higher resolution analysis than binary peak calls
- Investigating gradual changes in chromatin accessibility

**Data Requirements**:
- Fragment count matrix (cells × genomic regions)
- Count data (not binary)

**Basic Usage**:
```python
scvi.model.POISSONVI.setup_anndata(
    adata,
    batch_key="batch"
)

model = scvi.model.POISSONVI(adata)
model.train()

# Get results
latent = model.get_latent_representation()
accessibility = model.get_accessibility_estimates()
```

**Key Differences from PeakVI**:
- **PeakVI**: Best for standard peak count matrices, faster
- **PoissonVI**: Best for quantitative fragment counts, more detailed

**When to Choose PoissonVI over PeakVI**:
- Working with fragment counts rather than called peaks
- Need to capture quantitative differences
- Have high-quality, high-coverage data
- Interested in subtle accessibility changes

## scBasset

**Purpose**: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.

**Key Features**:
- Convolutional neural network (CNN) architecture for sequence-based analysis
- Models raw DNA sequences, not just peak counts
- Enables motif discovery and transcription factor (TF) binding prediction
- Provides interpretable feature importance
- Performs batch correction

**When to Use**:
- Want to incorporate DNA sequence information
- Interested in TF motif analysis
- Need interpretable models (which sequences drive accessibility)
- Analyzing regulatory elements and TF binding sites
- Predicting accessibility from sequence alone

**Data Requirements**:
- Peak sequences (extracted from genome)
- Peak accessibility matrix
- Genome reference (for sequence extraction)

**Basic Usage**:
```python
# scBasset requires sequence information
# First, extract sequences for peaks
from scbasset import utils
sequences = utils.fetch_sequences(adata, genome="hg38")

# Setup and train
scvi.model.SCBASSET.setup_anndata(
    adata,
    batch_key="batch"
)

model = scvi.model.SCBASSET(adata, sequences=sequences)
model.train()

# Get latent representation
latent = model.get_latent_representation()

# Interpret model: which sequences/motifs are important
importance_scores = model.get_feature_importance()
```

**Key Parameters**:
- `n_latent`: Latent space dimensionality
- `conv_layers`: Number of convolutional layers
- `n_filters`: Number of filters per conv layer
- `filter_size`: Size of convolutional filters

**Advanced Features**:
- **In silico mutagenesis**: Predict how sequence changes affect accessibility
- **Motif enrichment**: Identify enriched TF motifs in accessible regions
- **Batch correction**: Similar to other scvi-tools models
- **Transfer learning**: Fine-tune on new datasets

**Interpretability Tools**:
```python
# Get importance scores for sequences
importance = model.get_sequence_importance(region_indices=[0, 1, 2])

# Predict accessibility for new sequences
predictions = model.predict_accessibility(new_sequences)
```

## Model Selection for ATAC-seq

### PeakVI
**Choose when**:
- Standard scATAC-seq analysis workflow
- Have peak count matrices (most common format)
- Need fast, efficient batch correction
- Want straightforward differential accessibility
- Prioritize computational efficiency

**Advantages**:
- Fast training and inference
- Proven track record for scATAC-seq
- Easy integration with scanpy workflow
- Robust batch correction

### PoissonVI
**Choose when**:
- Have fragment-level count data
- Need quantitative accessibility measures
- Interested in subtle differences
- Have high-coverage, high-quality data

**Advantages**:
- More detailed quantitative information
- Better for gradient changes
- Appropriate statistical model for counts

### scBasset
**Choose when**:
- Want to incorporate DNA sequence
- Need biological interpretation (motifs, TFs)
- Interested in regulatory mechanisms
- Have computational resources for CNN training
- Want predictive power for new sequences

**Advantages**:
- Sequence-based, biologically interpretable
- Motif and TF analysis built-in
- Predictive modeling capabilities
- In silico perturbation experiments

## Workflow Example: Complete ATAC-seq Analysis

```python
import scvi
import scanpy as sc

# 1. Load and preprocess ATAC-seq data
adata = sc.read_h5ad("atac_data.h5ad")

# 2. Filter low-quality peaks
sc.pp.filter_genes(adata, min_cells=10)

# 3. Setup and train PeakVI
scvi.model.PEAKVI.setup_anndata(
    adata,
    batch_key="sample"
)

model = scvi.model.PEAKVI(adata, n_latent=20)
model.train(max_epochs=400)

# 4. Extract latent representation
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent

# 5. Downstream analysis
sc.pp.neighbors(adata, use_rep="X_PeakVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")

# 6. Differential accessibility
da_results = model.differential_accessibility(
    groupby="clusters",
    group1="0",
    group2="1"
)

# 7. Save model
model.save("peakvi_model")
```

## Integration with Gene Expression (RNA+ATAC)

For paired multimodal data (RNA+ATAC from same cells), use **MultiVI** instead:

```python
# For 10x Multiome or similar paired data
scvi.model.MULTIVI.setup_anndata(
    adata,
    batch_key="sample",
    modality_key="modality"  # "RNA" or "ATAC"
)

model = scvi.model.MULTIVI(adata)
model.train()

# Get joint latent space
latent = model.get_latent_representation()
```

See `models-multimodal.md` for more details on multimodal integration.

## Best Practices for ATAC-seq Analysis

1. **Quality Control**:
   - Filter cells with very low or very high peak counts
   - Remove peaks present in very few cells
   - Filter mitochondrial and sex chromosome peaks if needed

2. **Batch Correction**:
   - Always include `batch_key` if integrating multiple samples
   - Consider technical covariates (sequencing depth, TSS enrichment)

3. **Feature Selection**:
   - Unlike RNA-seq, all peaks are often used
   - Consider filtering very rare peaks for efficiency

4. **Latent Dimensions**:
   - Start with `n_latent=10-30` depending on dataset complexity
   - Larger values for more heterogeneous datasets

5. **Downstream Analysis**:
   - Use latent representations for clustering and visualization
   - Link peaks to genes for regulatory analysis
   - Perform motif enrichment on cluster-specific peaks

6. **Computational Considerations**:
   - ATAC-seq matrices are often very large (many peaks)
   - Consider downsampling peaks for initial exploration
   - Use GPU acceleration for large datasets