582 lines
14 KiB
Markdown
582 lines
14 KiB
Markdown
# Differential Expression Analysis in scvi-tools
|
||
|
||
This document provides detailed information about differential expression (DE) analysis using scvi-tools' probabilistic framework.
|
||
|
||
## Overview
|
||
|
||
scvi-tools implements Bayesian differential expression testing that leverages the learned generative models to estimate expression differences between groups. This approach provides several advantages over traditional methods:
|
||
|
||
- **Batch correction**: DE testing on batch-corrected representations
|
||
- **Uncertainty quantification**: Probabilistic estimates of effect sizes
|
||
- **Zero-inflation handling**: Proper modeling of dropout and zeros
|
||
- **Flexible comparisons**: Between any groups or cell types
|
||
- **Multiple modalities**: Works for RNA, proteins (totalVI), and accessibility (PeakVI)
|
||
|
||
## Core Statistical Framework
|
||
|
||
### Problem Definition
|
||
|
||
The goal is to estimate the log fold-change in expression between two conditions:
|
||
|
||
```
|
||
log fold-change = log(μ_B) - log(μ_A)
|
||
```
|
||
|
||
Where μ_A and μ_B are the mean expression levels in conditions A and B.
|
||
|
||
### Three-Stage Process
|
||
|
||
**Stage 1: Estimating Expression Levels**
|
||
- Sample from posterior distribution of cellular states
|
||
- Generate expression values from the learned generative model
|
||
- Aggregate across cells to get population-level estimates
|
||
|
||
**Stage 2: Detecting Relevant Features (Hypothesis Testing)**
|
||
- Test for differential expression using Bayesian framework
|
||
- Two testing modes available:
|
||
- **"vanilla" mode**: Point null hypothesis (β = 0)
|
||
- **"change" mode**: Composite hypothesis (|β| ≤ δ)
|
||
|
||
**Stage 3: Controlling False Discovery**
|
||
- Posterior expected False Discovery Proportion (FDP) control
|
||
- Selects maximum number of discoveries ensuring E[FDP] ≤ α
|
||
|
||
## Basic Usage
|
||
|
||
### Simple Two-Group Comparison
|
||
|
||
```python
|
||
import scvi
|
||
|
||
# After training a model
|
||
model = scvi.model.SCVI(adata)
|
||
model.train()
|
||
|
||
# Compare two cell types
|
||
de_results = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
group2="B cells"
|
||
)
|
||
|
||
# View top DE genes
|
||
top_genes = de_results.sort_values("lfc_mean", ascending=False).head(20)
|
||
print(top_genes[["lfc_mean", "lfc_std", "bayes_factor", "is_de_fdr_0.05"]])
|
||
```
|
||
|
||
### One vs. Rest Comparison
|
||
|
||
```python
|
||
# Compare one group against all others
|
||
de_results = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells" # No group2 = compare to rest
|
||
)
|
||
```
|
||
|
||
### All Pairwise Comparisons
|
||
|
||
```python
|
||
# Compare all cell types pairwise
|
||
all_comparisons = {}
|
||
|
||
cell_types = adata.obs["cell_type"].unique()
|
||
|
||
for ct1 in cell_types:
|
||
for ct2 in cell_types:
|
||
if ct1 != ct2:
|
||
key = f"{ct1}_vs_{ct2}"
|
||
all_comparisons[key] = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1=ct1,
|
||
group2=ct2
|
||
)
|
||
```
|
||
|
||
## Key Parameters
|
||
|
||
### `groupby` (required)
|
||
Column in `adata.obs` defining groups to compare.
|
||
|
||
```python
|
||
# Must be a categorical variable
|
||
de_results = model.differential_expression(groupby="cell_type")
|
||
```
|
||
|
||
### `group1` and `group2`
|
||
Groups to compare. If `group2` is None, compares `group1` to all others.
|
||
|
||
```python
|
||
# Specific comparison
|
||
de = model.differential_expression(groupby="condition", group1="treated", group2="control")
|
||
|
||
# One vs rest
|
||
de = model.differential_expression(groupby="cell_type", group1="T cells")
|
||
```
|
||
|
||
### `mode` (Hypothesis Testing Mode)
|
||
|
||
**"vanilla" mode** (default): Point null hypothesis
|
||
- Tests if β = 0 exactly
|
||
- More sensitive, but may find trivially small effects
|
||
|
||
**"change" mode**: Composite null hypothesis
|
||
- Tests if |β| ≤ δ
|
||
- Requires biologically meaningful change
|
||
- Reduces false discoveries of tiny effects
|
||
|
||
```python
|
||
# Change mode with minimum effect size
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
group2="B cells",
|
||
mode="change",
|
||
delta=0.25 # Minimum log fold-change
|
||
)
|
||
```
|
||
|
||
### `delta`
|
||
Minimum effect size threshold for "change" mode.
|
||
- Typical values: 0.25, 0.5, 0.7 (log scale)
|
||
- log2(1.5) ≈ 0.58 (1.5-fold change)
|
||
- log2(2) = 1.0 (2-fold change)
|
||
|
||
```python
|
||
# Require at least 1.5-fold change
|
||
de = model.differential_expression(
|
||
groupby="condition",
|
||
group1="disease",
|
||
group2="healthy",
|
||
mode="change",
|
||
delta=0.58 # log2(1.5)
|
||
)
|
||
```
|
||
|
||
### `fdr_target`
|
||
False discovery rate threshold (default: 0.05)
|
||
|
||
```python
|
||
# More stringent FDR control
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
fdr_target=0.01
|
||
)
|
||
```
|
||
|
||
### `batch_correction`
|
||
Whether to perform batch correction during DE testing (default: True)
|
||
|
||
```python
|
||
# Test within a specific batch
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
group2="B cells",
|
||
batch_correction=False
|
||
)
|
||
```
|
||
|
||
### `n_samples`
|
||
Number of posterior samples for estimation (default: 5000)
|
||
- More samples = more accurate but slower
|
||
- Reduce for speed, increase for precision
|
||
|
||
```python
|
||
# High precision analysis
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
n_samples=10000
|
||
)
|
||
```
|
||
|
||
## Interpreting Results
|
||
|
||
### Output Columns
|
||
|
||
The results DataFrame contains several important columns:
|
||
|
||
**Effect Size Estimates**:
|
||
- `lfc_mean`: Mean log fold-change
|
||
- `lfc_median`: Median log fold-change
|
||
- `lfc_std`: Standard deviation of log fold-change
|
||
- `lfc_min`: Lower bound of effect size
|
||
- `lfc_max`: Upper bound of effect size
|
||
|
||
**Statistical Significance**:
|
||
- `bayes_factor`: Bayes factor for differential expression
|
||
- Higher values = stronger evidence
|
||
- >3 often considered meaningful
|
||
- `is_de_fdr_0.05`: Boolean indicating if gene is DE at FDR 0.05
|
||
- `is_de_fdr_0.1`: Boolean indicating if gene is DE at FDR 0.1
|
||
|
||
**Expression Levels**:
|
||
- `mean1`: Mean expression in group 1
|
||
- `mean2`: Mean expression in group 2
|
||
- `non_zeros_proportion1`: Proportion of non-zero cells in group 1
|
||
- `non_zeros_proportion2`: Proportion of non-zero cells in group 2
|
||
|
||
### Example Interpretation
|
||
|
||
```python
|
||
de_results = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
group2="B cells"
|
||
)
|
||
|
||
# Find significantly upregulated genes in T cells
|
||
upreg_tcells = de_results[
|
||
(de_results["is_de_fdr_0.05"]) &
|
||
(de_results["lfc_mean"] > 0)
|
||
].sort_values("lfc_mean", ascending=False)
|
||
|
||
print(f"Upregulated genes in T cells: {len(upreg_tcells)}")
|
||
print(upreg_tcells.head(10))
|
||
|
||
# Find genes with large effect sizes
|
||
large_effect = de_results[
|
||
(de_results["is_de_fdr_0.05"]) &
|
||
(abs(de_results["lfc_mean"]) > 1) # 2-fold change
|
||
]
|
||
```
|
||
|
||
## Advanced Usage
|
||
|
||
### DE Within Specific Cells
|
||
|
||
```python
|
||
# Test DE only within a subset of cells
|
||
subset_indices = adata.obs["tissue"] == "lung"
|
||
|
||
de = model.differential_expression(
|
||
idx1=adata.obs["cell_type"] == "T cells" & subset_indices,
|
||
idx2=adata.obs["cell_type"] == "B cells" & subset_indices
|
||
)
|
||
```
|
||
|
||
### Batch-Specific DE
|
||
|
||
```python
|
||
# Test DE within each batch separately
|
||
batches = adata.obs["batch"].unique()
|
||
|
||
batch_de_results = {}
|
||
for batch in batches:
|
||
batch_idx = adata.obs["batch"] == batch
|
||
batch_de_results[batch] = model.differential_expression(
|
||
idx1=(adata.obs["condition"] == "treated") & batch_idx,
|
||
idx2=(adata.obs["condition"] == "control") & batch_idx
|
||
)
|
||
```
|
||
|
||
### Pseudo-bulk DE
|
||
|
||
```python
|
||
# Aggregate cells before DE testing
|
||
# Useful for low cell counts per group
|
||
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="rare_cell_type",
|
||
group2="common_cell_type",
|
||
n_samples=10000, # More samples for stability
|
||
batch_correction=True
|
||
)
|
||
```
|
||
|
||
## Visualization
|
||
|
||
### Volcano Plot
|
||
|
||
```python
|
||
import matplotlib.pyplot as plt
|
||
import numpy as np
|
||
|
||
de = model.differential_expression(
|
||
groupby="condition",
|
||
group1="treated",
|
||
group2="control"
|
||
)
|
||
|
||
# Volcano plot
|
||
plt.figure(figsize=(10, 6))
|
||
plt.scatter(
|
||
de["lfc_mean"],
|
||
-np.log10(1 / (de["bayes_factor"] + 1)),
|
||
c=de["is_de_fdr_0.05"],
|
||
cmap="coolwarm",
|
||
alpha=0.5
|
||
)
|
||
plt.xlabel("Log Fold Change")
|
||
plt.ylabel("-log10(1/Bayes Factor)")
|
||
plt.title("Volcano Plot: Treated vs Control")
|
||
plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
|
||
plt.show()
|
||
```
|
||
|
||
### Heatmap of Top DE Genes
|
||
|
||
```python
|
||
import seaborn as sns
|
||
|
||
# Get top DE genes
|
||
top_genes = de.sort_values("lfc_mean", ascending=False).head(50).index
|
||
|
||
# Get normalized expression
|
||
norm_expr = model.get_normalized_expression(
|
||
adata,
|
||
indices=adata.obs["condition"].isin(["treated", "control"]),
|
||
gene_list=top_genes
|
||
)
|
||
|
||
# Plot heatmap
|
||
plt.figure(figsize=(12, 10))
|
||
sns.heatmap(
|
||
norm_expr.T,
|
||
cmap="viridis",
|
||
xticklabels=False,
|
||
yticklabels=top_genes
|
||
)
|
||
plt.title("Top 50 DE Genes")
|
||
plt.show()
|
||
```
|
||
|
||
### Ranked Gene Plot
|
||
|
||
```python
|
||
# Plot genes ranked by effect size
|
||
de_sorted = de.sort_values("lfc_mean", ascending=False)
|
||
|
||
plt.figure(figsize=(12, 6))
|
||
plt.plot(range(len(de_sorted)), de_sorted["lfc_mean"].values)
|
||
plt.axhline(y=0, color='r', linestyle='--')
|
||
plt.xlabel("Gene Rank")
|
||
plt.ylabel("Log Fold Change")
|
||
plt.title("Genes Ranked by Effect Size")
|
||
plt.show()
|
||
```
|
||
|
||
## Comparison with Traditional Methods
|
||
|
||
### scvi-tools vs. Wilcoxon Test
|
||
|
||
```python
|
||
import scanpy as sc
|
||
|
||
# Traditional Wilcoxon test
|
||
sc.tl.rank_genes_groups(
|
||
adata,
|
||
groupby="cell_type",
|
||
method="wilcoxon",
|
||
key_added="wilcoxon"
|
||
)
|
||
|
||
# scvi-tools DE
|
||
de_scvi = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells"
|
||
)
|
||
|
||
# Compare results
|
||
wilcox_results = sc.get.rank_genes_groups_df(adata, group="T cells", key="wilcoxon")
|
||
```
|
||
|
||
**Advantages of scvi-tools**:
|
||
- Accounts for batch effects automatically
|
||
- Handles zero-inflation properly
|
||
- Provides uncertainty quantification
|
||
- No arbitrary pseudocount needed
|
||
- Better statistical properties
|
||
|
||
**When to use Wilcoxon**:
|
||
- Very quick exploratory analysis
|
||
- Comparison with published results using Wilcoxon
|
||
|
||
## Multi-Modal DE
|
||
|
||
### Protein DE (totalVI)
|
||
|
||
```python
|
||
# Train totalVI on CITE-seq data
|
||
totalvi_model = scvi.model.TOTALVI(adata)
|
||
totalvi_model.train()
|
||
|
||
# RNA differential expression
|
||
rna_de = totalvi_model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
group2="B cells",
|
||
protein_expression=False # Default
|
||
)
|
||
|
||
# Protein differential expression
|
||
protein_de = totalvi_model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
group2="B cells",
|
||
protein_expression=True
|
||
)
|
||
|
||
print(f"DE genes: {rna_de['is_de_fdr_0.05'].sum()}")
|
||
print(f"DE proteins: {protein_de['is_de_fdr_0.05'].sum()}")
|
||
```
|
||
|
||
### Differential Accessibility (PeakVI)
|
||
|
||
```python
|
||
# Train PeakVI on ATAC-seq data
|
||
peakvi_model = scvi.model.PEAKVI(atac_adata)
|
||
peakvi_model.train()
|
||
|
||
# Differential accessibility
|
||
da = peakvi_model.differential_accessibility(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
group2="B cells"
|
||
)
|
||
|
||
# Same interpretation as DE
|
||
```
|
||
|
||
## Handling Special Cases
|
||
|
||
### Low Cell Count Groups
|
||
|
||
```python
|
||
# Increase posterior samples for stability
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="rare_type", # e.g., 50 cells
|
||
group2="common_type", # e.g., 5000 cells
|
||
n_samples=10000
|
||
)
|
||
```
|
||
|
||
### Imbalanced Comparisons
|
||
|
||
```python
|
||
# When groups have very different sizes
|
||
# Use change mode to avoid tiny effects
|
||
|
||
de = model.differential_expression(
|
||
groupby="condition",
|
||
group1="rare_condition",
|
||
group2="common_condition",
|
||
mode="change",
|
||
delta=0.5
|
||
)
|
||
```
|
||
|
||
### Multiple Testing Correction
|
||
|
||
```python
|
||
# Already included via FDP control
|
||
# But can apply additional corrections
|
||
|
||
from statsmodels.stats.multitest import multipletests
|
||
|
||
# Bonferroni correction (very conservative)
|
||
_, pvals_corrected, _, _ = multipletests(
|
||
1 / (de["bayes_factor"] + 1),
|
||
method="bonferroni"
|
||
)
|
||
```
|
||
|
||
## Performance Considerations
|
||
|
||
### Speed Optimization
|
||
|
||
```python
|
||
# Faster DE testing for large datasets
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="T cells",
|
||
n_samples=1000, # Reduce samples
|
||
batch_size=512 # Increase batch size
|
||
)
|
||
```
|
||
|
||
### Memory Management
|
||
|
||
```python
|
||
# For very large datasets
|
||
# Test one comparison at a time rather than all pairwise
|
||
|
||
cell_types = adata.obs["cell_type"].unique()
|
||
for ct in cell_types:
|
||
de = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1=ct
|
||
)
|
||
# Save results
|
||
de.to_csv(f"de_results_{ct}.csv")
|
||
```
|
||
|
||
## Best Practices
|
||
|
||
1. **Use "change" mode**: For biologically interpretable results
|
||
2. **Set appropriate delta**: Based on biological significance
|
||
3. **Check expression levels**: Filter lowly expressed genes
|
||
4. **Validate findings**: Check marker genes for sanity
|
||
5. **Visualize results**: Always plot top DE genes
|
||
6. **Report parameters**: Document mode, delta, FDR used
|
||
7. **Consider batch effects**: Use batch_correction=True
|
||
8. **Multiple comparisons**: Be aware of testing many groups
|
||
9. **Sample size**: Ensure sufficient cells per group (>50 recommended)
|
||
10. **Biological validation**: Follow up with functional experiments
|
||
|
||
## Example: Complete DE Analysis Workflow
|
||
|
||
```python
|
||
import scvi
|
||
import scanpy as sc
|
||
import matplotlib.pyplot as plt
|
||
|
||
# 1. Train model
|
||
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
|
||
model = scvi.model.SCVI(adata)
|
||
model.train()
|
||
|
||
# 2. Perform DE analysis
|
||
de_results = model.differential_expression(
|
||
groupby="cell_type",
|
||
group1="Disease_T_cells",
|
||
group2="Healthy_T_cells",
|
||
mode="change",
|
||
delta=0.5,
|
||
fdr_target=0.05
|
||
)
|
||
|
||
# 3. Filter and analyze
|
||
sig_genes = de_results[de_results["is_de_fdr_0.05"]]
|
||
upreg = sig_genes[sig_genes["lfc_mean"] > 0].sort_values("lfc_mean", ascending=False)
|
||
downreg = sig_genes[sig_genes["lfc_mean"] < 0].sort_values("lfc_mean")
|
||
|
||
print(f"Significant genes: {len(sig_genes)}")
|
||
print(f"Upregulated: {len(upreg)}")
|
||
print(f"Downregulated: {len(downreg)}")
|
||
|
||
# 4. Visualize top genes
|
||
top_genes = upreg.head(10).index.tolist() + downreg.head(10).index.tolist()
|
||
|
||
sc.pl.violin(
|
||
adata[adata.obs["cell_type"].isin(["Disease_T_cells", "Healthy_T_cells"])],
|
||
keys=top_genes,
|
||
groupby="cell_type",
|
||
rotation=90
|
||
)
|
||
|
||
# 5. Functional enrichment (using external tools)
|
||
# E.g., g:Profiler, DAVID, or gprofiler-official Python package
|
||
upreg_genes = upreg.head(100).index.tolist()
|
||
# Perform pathway analysis...
|
||
|
||
# 6. Save results
|
||
de_results.to_csv("de_results_disease_vs_healthy.csv")
|
||
upreg.to_csv("upregulated_genes.csv")
|
||
downreg.to_csv("downregulated_genes.csv")
|
||
```
|