Initial commit
This commit is contained in:
581
skills/scvi-tools/references/differential-expression.md
Normal file
581
skills/scvi-tools/references/differential-expression.md
Normal file
@@ -0,0 +1,581 @@
|
||||
# Differential Expression Analysis in scvi-tools
|
||||
|
||||
This document provides detailed information about differential expression (DE) analysis using scvi-tools' probabilistic framework.
|
||||
|
||||
## Overview
|
||||
|
||||
scvi-tools implements Bayesian differential expression testing that leverages the learned generative models to estimate expression differences between groups. This approach provides several advantages over traditional methods:
|
||||
|
||||
- **Batch correction**: DE testing on batch-corrected representations
|
||||
- **Uncertainty quantification**: Probabilistic estimates of effect sizes
|
||||
- **Zero-inflation handling**: Proper modeling of dropout and zeros
|
||||
- **Flexible comparisons**: Between any groups or cell types
|
||||
- **Multiple modalities**: Works for RNA, proteins (totalVI), and accessibility (PeakVI)
|
||||
|
||||
## Core Statistical Framework
|
||||
|
||||
### Problem Definition
|
||||
|
||||
The goal is to estimate the log fold-change in expression between two conditions:
|
||||
|
||||
```
|
||||
log fold-change = log(μ_B) - log(μ_A)
|
||||
```
|
||||
|
||||
Where μ_A and μ_B are the mean expression levels in conditions A and B.
|
||||
|
||||
### Three-Stage Process
|
||||
|
||||
**Stage 1: Estimating Expression Levels**
|
||||
- Sample from posterior distribution of cellular states
|
||||
- Generate expression values from the learned generative model
|
||||
- Aggregate across cells to get population-level estimates
|
||||
|
||||
**Stage 2: Detecting Relevant Features (Hypothesis Testing)**
|
||||
- Test for differential expression using Bayesian framework
|
||||
- Two testing modes available:
|
||||
- **"vanilla" mode**: Point null hypothesis (β = 0)
|
||||
- **"change" mode**: Composite hypothesis (|β| ≤ δ)
|
||||
|
||||
**Stage 3: Controlling False Discovery**
|
||||
- Posterior expected False Discovery Proportion (FDP) control
|
||||
- Selects maximum number of discoveries ensuring E[FDP] ≤ α
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Simple Two-Group Comparison
|
||||
|
||||
```python
|
||||
import scvi
|
||||
|
||||
# After training a model
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# Compare two cell types
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells"
|
||||
)
|
||||
|
||||
# View top DE genes
|
||||
top_genes = de_results.sort_values("lfc_mean", ascending=False).head(20)
|
||||
print(top_genes[["lfc_mean", "lfc_std", "bayes_factor", "is_de_fdr_0.05"]])
|
||||
```
|
||||
|
||||
### One vs. Rest Comparison
|
||||
|
||||
```python
|
||||
# Compare one group against all others
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells" # No group2 = compare to rest
|
||||
)
|
||||
```
|
||||
|
||||
### All Pairwise Comparisons
|
||||
|
||||
```python
|
||||
# Compare all cell types pairwise
|
||||
all_comparisons = {}
|
||||
|
||||
cell_types = adata.obs["cell_type"].unique()
|
||||
|
||||
for ct1 in cell_types:
|
||||
for ct2 in cell_types:
|
||||
if ct1 != ct2:
|
||||
key = f"{ct1}_vs_{ct2}"
|
||||
all_comparisons[key] = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1=ct1,
|
||||
group2=ct2
|
||||
)
|
||||
```
|
||||
|
||||
## Key Parameters
|
||||
|
||||
### `groupby` (required)
|
||||
Column in `adata.obs` defining groups to compare.
|
||||
|
||||
```python
|
||||
# Must be a categorical variable
|
||||
de_results = model.differential_expression(groupby="cell_type")
|
||||
```
|
||||
|
||||
### `group1` and `group2`
|
||||
Groups to compare. If `group2` is None, compares `group1` to all others.
|
||||
|
||||
```python
|
||||
# Specific comparison
|
||||
de = model.differential_expression(groupby="condition", group1="treated", group2="control")
|
||||
|
||||
# One vs rest
|
||||
de = model.differential_expression(groupby="cell_type", group1="T cells")
|
||||
```
|
||||
|
||||
### `mode` (Hypothesis Testing Mode)
|
||||
|
||||
**"vanilla" mode** (default): Point null hypothesis
|
||||
- Tests if β = 0 exactly
|
||||
- More sensitive, but may find trivially small effects
|
||||
|
||||
**"change" mode**: Composite null hypothesis
|
||||
- Tests if |β| ≤ δ
|
||||
- Requires biologically meaningful change
|
||||
- Reduces false discoveries of tiny effects
|
||||
|
||||
```python
|
||||
# Change mode with minimum effect size
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
mode="change",
|
||||
delta=0.25 # Minimum log fold-change
|
||||
)
|
||||
```
|
||||
|
||||
### `delta`
|
||||
Minimum effect size threshold for "change" mode.
|
||||
- Typical values: 0.25, 0.5, 0.7 (log scale)
|
||||
- log2(1.5) ≈ 0.58 (1.5-fold change)
|
||||
- log2(2) = 1.0 (2-fold change)
|
||||
|
||||
```python
|
||||
# Require at least 1.5-fold change
|
||||
de = model.differential_expression(
|
||||
groupby="condition",
|
||||
group1="disease",
|
||||
group2="healthy",
|
||||
mode="change",
|
||||
delta=0.58 # log2(1.5)
|
||||
)
|
||||
```
|
||||
|
||||
### `fdr_target`
|
||||
False discovery rate threshold (default: 0.05)
|
||||
|
||||
```python
|
||||
# More stringent FDR control
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
fdr_target=0.01
|
||||
)
|
||||
```
|
||||
|
||||
### `batch_correction`
|
||||
Whether to perform batch correction during DE testing (default: True)
|
||||
|
||||
```python
|
||||
# Test within a specific batch
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
batch_correction=False
|
||||
)
|
||||
```
|
||||
|
||||
### `n_samples`
|
||||
Number of posterior samples for estimation (default: 5000)
|
||||
- More samples = more accurate but slower
|
||||
- Reduce for speed, increase for precision
|
||||
|
||||
```python
|
||||
# High precision analysis
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
n_samples=10000
|
||||
)
|
||||
```
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Output Columns
|
||||
|
||||
The results DataFrame contains several important columns:
|
||||
|
||||
**Effect Size Estimates**:
|
||||
- `lfc_mean`: Mean log fold-change
|
||||
- `lfc_median`: Median log fold-change
|
||||
- `lfc_std`: Standard deviation of log fold-change
|
||||
- `lfc_min`: Lower bound of effect size
|
||||
- `lfc_max`: Upper bound of effect size
|
||||
|
||||
**Statistical Significance**:
|
||||
- `bayes_factor`: Bayes factor for differential expression
|
||||
- Higher values = stronger evidence
|
||||
- >3 often considered meaningful
|
||||
- `is_de_fdr_0.05`: Boolean indicating if gene is DE at FDR 0.05
|
||||
- `is_de_fdr_0.1`: Boolean indicating if gene is DE at FDR 0.1
|
||||
|
||||
**Expression Levels**:
|
||||
- `mean1`: Mean expression in group 1
|
||||
- `mean2`: Mean expression in group 2
|
||||
- `non_zeros_proportion1`: Proportion of non-zero cells in group 1
|
||||
- `non_zeros_proportion2`: Proportion of non-zero cells in group 2
|
||||
|
||||
### Example Interpretation
|
||||
|
||||
```python
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells"
|
||||
)
|
||||
|
||||
# Find significantly upregulated genes in T cells
|
||||
upreg_tcells = de_results[
|
||||
(de_results["is_de_fdr_0.05"]) &
|
||||
(de_results["lfc_mean"] > 0)
|
||||
].sort_values("lfc_mean", ascending=False)
|
||||
|
||||
print(f"Upregulated genes in T cells: {len(upreg_tcells)}")
|
||||
print(upreg_tcells.head(10))
|
||||
|
||||
# Find genes with large effect sizes
|
||||
large_effect = de_results[
|
||||
(de_results["is_de_fdr_0.05"]) &
|
||||
(abs(de_results["lfc_mean"]) > 1) # 2-fold change
|
||||
]
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### DE Within Specific Cells
|
||||
|
||||
```python
|
||||
# Test DE only within a subset of cells
|
||||
subset_indices = adata.obs["tissue"] == "lung"
|
||||
|
||||
de = model.differential_expression(
|
||||
idx1=adata.obs["cell_type"] == "T cells" & subset_indices,
|
||||
idx2=adata.obs["cell_type"] == "B cells" & subset_indices
|
||||
)
|
||||
```
|
||||
|
||||
### Batch-Specific DE
|
||||
|
||||
```python
|
||||
# Test DE within each batch separately
|
||||
batches = adata.obs["batch"].unique()
|
||||
|
||||
batch_de_results = {}
|
||||
for batch in batches:
|
||||
batch_idx = adata.obs["batch"] == batch
|
||||
batch_de_results[batch] = model.differential_expression(
|
||||
idx1=(adata.obs["condition"] == "treated") & batch_idx,
|
||||
idx2=(adata.obs["condition"] == "control") & batch_idx
|
||||
)
|
||||
```
|
||||
|
||||
### Pseudo-bulk DE
|
||||
|
||||
```python
|
||||
# Aggregate cells before DE testing
|
||||
# Useful for low cell counts per group
|
||||
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="rare_cell_type",
|
||||
group2="common_cell_type",
|
||||
n_samples=10000, # More samples for stability
|
||||
batch_correction=True
|
||||
)
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Volcano Plot
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
de = model.differential_expression(
|
||||
groupby="condition",
|
||||
group1="treated",
|
||||
group2="control"
|
||||
)
|
||||
|
||||
# Volcano plot
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.scatter(
|
||||
de["lfc_mean"],
|
||||
-np.log10(1 / (de["bayes_factor"] + 1)),
|
||||
c=de["is_de_fdr_0.05"],
|
||||
cmap="coolwarm",
|
||||
alpha=0.5
|
||||
)
|
||||
plt.xlabel("Log Fold Change")
|
||||
plt.ylabel("-log10(1/Bayes Factor)")
|
||||
plt.title("Volcano Plot: Treated vs Control")
|
||||
plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Heatmap of Top DE Genes
|
||||
|
||||
```python
|
||||
import seaborn as sns
|
||||
|
||||
# Get top DE genes
|
||||
top_genes = de.sort_values("lfc_mean", ascending=False).head(50).index
|
||||
|
||||
# Get normalized expression
|
||||
norm_expr = model.get_normalized_expression(
|
||||
adata,
|
||||
indices=adata.obs["condition"].isin(["treated", "control"]),
|
||||
gene_list=top_genes
|
||||
)
|
||||
|
||||
# Plot heatmap
|
||||
plt.figure(figsize=(12, 10))
|
||||
sns.heatmap(
|
||||
norm_expr.T,
|
||||
cmap="viridis",
|
||||
xticklabels=False,
|
||||
yticklabels=top_genes
|
||||
)
|
||||
plt.title("Top 50 DE Genes")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Ranked Gene Plot
|
||||
|
||||
```python
|
||||
# Plot genes ranked by effect size
|
||||
de_sorted = de.sort_values("lfc_mean", ascending=False)
|
||||
|
||||
plt.figure(figsize=(12, 6))
|
||||
plt.plot(range(len(de_sorted)), de_sorted["lfc_mean"].values)
|
||||
plt.axhline(y=0, color='r', linestyle='--')
|
||||
plt.xlabel("Gene Rank")
|
||||
plt.ylabel("Log Fold Change")
|
||||
plt.title("Genes Ranked by Effect Size")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Comparison with Traditional Methods
|
||||
|
||||
### scvi-tools vs. Wilcoxon Test
|
||||
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Traditional Wilcoxon test
|
||||
sc.tl.rank_genes_groups(
|
||||
adata,
|
||||
groupby="cell_type",
|
||||
method="wilcoxon",
|
||||
key_added="wilcoxon"
|
||||
)
|
||||
|
||||
# scvi-tools DE
|
||||
de_scvi = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells"
|
||||
)
|
||||
|
||||
# Compare results
|
||||
wilcox_results = sc.get.rank_genes_groups_df(adata, group="T cells", key="wilcoxon")
|
||||
```
|
||||
|
||||
**Advantages of scvi-tools**:
|
||||
- Accounts for batch effects automatically
|
||||
- Handles zero-inflation properly
|
||||
- Provides uncertainty quantification
|
||||
- No arbitrary pseudocount needed
|
||||
- Better statistical properties
|
||||
|
||||
**When to use Wilcoxon**:
|
||||
- Very quick exploratory analysis
|
||||
- Comparison with published results using Wilcoxon
|
||||
|
||||
## Multi-Modal DE
|
||||
|
||||
### Protein DE (totalVI)
|
||||
|
||||
```python
|
||||
# Train totalVI on CITE-seq data
|
||||
totalvi_model = scvi.model.TOTALVI(adata)
|
||||
totalvi_model.train()
|
||||
|
||||
# RNA differential expression
|
||||
rna_de = totalvi_model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
protein_expression=False # Default
|
||||
)
|
||||
|
||||
# Protein differential expression
|
||||
protein_de = totalvi_model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells",
|
||||
protein_expression=True
|
||||
)
|
||||
|
||||
print(f"DE genes: {rna_de['is_de_fdr_0.05'].sum()}")
|
||||
print(f"DE proteins: {protein_de['is_de_fdr_0.05'].sum()}")
|
||||
```
|
||||
|
||||
### Differential Accessibility (PeakVI)
|
||||
|
||||
```python
|
||||
# Train PeakVI on ATAC-seq data
|
||||
peakvi_model = scvi.model.PEAKVI(atac_adata)
|
||||
peakvi_model.train()
|
||||
|
||||
# Differential accessibility
|
||||
da = peakvi_model.differential_accessibility(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
group2="B cells"
|
||||
)
|
||||
|
||||
# Same interpretation as DE
|
||||
```
|
||||
|
||||
## Handling Special Cases
|
||||
|
||||
### Low Cell Count Groups
|
||||
|
||||
```python
|
||||
# Increase posterior samples for stability
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="rare_type", # e.g., 50 cells
|
||||
group2="common_type", # e.g., 5000 cells
|
||||
n_samples=10000
|
||||
)
|
||||
```
|
||||
|
||||
### Imbalanced Comparisons
|
||||
|
||||
```python
|
||||
# When groups have very different sizes
|
||||
# Use change mode to avoid tiny effects
|
||||
|
||||
de = model.differential_expression(
|
||||
groupby="condition",
|
||||
group1="rare_condition",
|
||||
group2="common_condition",
|
||||
mode="change",
|
||||
delta=0.5
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple Testing Correction
|
||||
|
||||
```python
|
||||
# Already included via FDP control
|
||||
# But can apply additional corrections
|
||||
|
||||
from statsmodels.stats.multitest import multipletests
|
||||
|
||||
# Bonferroni correction (very conservative)
|
||||
_, pvals_corrected, _, _ = multipletests(
|
||||
1 / (de["bayes_factor"] + 1),
|
||||
method="bonferroni"
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Speed Optimization
|
||||
|
||||
```python
|
||||
# Faster DE testing for large datasets
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="T cells",
|
||||
n_samples=1000, # Reduce samples
|
||||
batch_size=512 # Increase batch size
|
||||
)
|
||||
```
|
||||
|
||||
### Memory Management
|
||||
|
||||
```python
|
||||
# For very large datasets
|
||||
# Test one comparison at a time rather than all pairwise
|
||||
|
||||
cell_types = adata.obs["cell_type"].unique()
|
||||
for ct in cell_types:
|
||||
de = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1=ct
|
||||
)
|
||||
# Save results
|
||||
de.to_csv(f"de_results_{ct}.csv")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use "change" mode**: For biologically interpretable results
|
||||
2. **Set appropriate delta**: Based on biological significance
|
||||
3. **Check expression levels**: Filter lowly expressed genes
|
||||
4. **Validate findings**: Check marker genes for sanity
|
||||
5. **Visualize results**: Always plot top DE genes
|
||||
6. **Report parameters**: Document mode, delta, FDR used
|
||||
7. **Consider batch effects**: Use batch_correction=True
|
||||
8. **Multiple comparisons**: Be aware of testing many groups
|
||||
9. **Sample size**: Ensure sufficient cells per group (>50 recommended)
|
||||
10. **Biological validation**: Follow up with functional experiments
|
||||
|
||||
## Example: Complete DE Analysis Workflow
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import scanpy as sc
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# 1. Train model
|
||||
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# 2. Perform DE analysis
|
||||
de_results = model.differential_expression(
|
||||
groupby="cell_type",
|
||||
group1="Disease_T_cells",
|
||||
group2="Healthy_T_cells",
|
||||
mode="change",
|
||||
delta=0.5,
|
||||
fdr_target=0.05
|
||||
)
|
||||
|
||||
# 3. Filter and analyze
|
||||
sig_genes = de_results[de_results["is_de_fdr_0.05"]]
|
||||
upreg = sig_genes[sig_genes["lfc_mean"] > 0].sort_values("lfc_mean", ascending=False)
|
||||
downreg = sig_genes[sig_genes["lfc_mean"] < 0].sort_values("lfc_mean")
|
||||
|
||||
print(f"Significant genes: {len(sig_genes)}")
|
||||
print(f"Upregulated: {len(upreg)}")
|
||||
print(f"Downregulated: {len(downreg)}")
|
||||
|
||||
# 4. Visualize top genes
|
||||
top_genes = upreg.head(10).index.tolist() + downreg.head(10).index.tolist()
|
||||
|
||||
sc.pl.violin(
|
||||
adata[adata.obs["cell_type"].isin(["Disease_T_cells", "Healthy_T_cells"])],
|
||||
keys=top_genes,
|
||||
groupby="cell_type",
|
||||
rotation=90
|
||||
)
|
||||
|
||||
# 5. Functional enrichment (using external tools)
|
||||
# E.g., g:Profiler, DAVID, or gprofiler-official Python package
|
||||
upreg_genes = upreg.head(100).index.tolist()
|
||||
# Perform pathway analysis...
|
||||
|
||||
# 6. Save results
|
||||
de_results.to_csv("de_results_disease_vs_healthy.csv")
|
||||
upreg.to_csv("upregulated_genes.csv")
|
||||
downreg.to_csv("downregulated_genes.csv")
|
||||
```
|
||||
Reference in New Issue
Block a user