Files
gh-k-dense-ai-claude-scient…/skills/scvi-tools/references/differential-expression.md
2025-11-30 08:30:10 +08:00

582 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Differential Expression Analysis in scvi-tools
This document provides detailed information about differential expression (DE) analysis using scvi-tools' probabilistic framework.
## Overview
scvi-tools implements Bayesian differential expression testing that leverages the learned generative models to estimate expression differences between groups. This approach provides several advantages over traditional methods:
- **Batch correction**: DE testing on batch-corrected representations
- **Uncertainty quantification**: Probabilistic estimates of effect sizes
- **Zero-inflation handling**: Proper modeling of dropout and zeros
- **Flexible comparisons**: Between any groups or cell types
- **Multiple modalities**: Works for RNA, proteins (totalVI), and accessibility (PeakVI)
## Core Statistical Framework
### Problem Definition
The goal is to estimate the log fold-change in expression between two conditions:
```
log fold-change = log(μ_B) - log(μ_A)
```
Where μ_A and μ_B are the mean expression levels in conditions A and B.
### Three-Stage Process
**Stage 1: Estimating Expression Levels**
- Sample from posterior distribution of cellular states
- Generate expression values from the learned generative model
- Aggregate across cells to get population-level estimates
**Stage 2: Detecting Relevant Features (Hypothesis Testing)**
- Test for differential expression using Bayesian framework
- Two testing modes available:
- **"vanilla" mode**: Point null hypothesis (β = 0)
- **"change" mode**: Composite hypothesis (|β| ≤ δ)
**Stage 3: Controlling False Discovery**
- Posterior expected False Discovery Proportion (FDP) control
- Selects maximum number of discoveries ensuring E[FDP] ≤ α
## Basic Usage
### Simple Two-Group Comparison
```python
import scvi
# After training a model
model = scvi.model.SCVI(adata)
model.train()
# Compare two cell types
de_results = model.differential_expression(
groupby="cell_type",
group1="T cells",
group2="B cells"
)
# View top DE genes
top_genes = de_results.sort_values("lfc_mean", ascending=False).head(20)
print(top_genes[["lfc_mean", "lfc_std", "bayes_factor", "is_de_fdr_0.05"]])
```
### One vs. Rest Comparison
```python
# Compare one group against all others
de_results = model.differential_expression(
groupby="cell_type",
group1="T cells" # No group2 = compare to rest
)
```
### All Pairwise Comparisons
```python
# Compare all cell types pairwise
all_comparisons = {}
cell_types = adata.obs["cell_type"].unique()
for ct1 in cell_types:
for ct2 in cell_types:
if ct1 != ct2:
key = f"{ct1}_vs_{ct2}"
all_comparisons[key] = model.differential_expression(
groupby="cell_type",
group1=ct1,
group2=ct2
)
```
## Key Parameters
### `groupby` (required)
Column in `adata.obs` defining groups to compare.
```python
# Must be a categorical variable
de_results = model.differential_expression(groupby="cell_type")
```
### `group1` and `group2`
Groups to compare. If `group2` is None, compares `group1` to all others.
```python
# Specific comparison
de = model.differential_expression(groupby="condition", group1="treated", group2="control")
# One vs rest
de = model.differential_expression(groupby="cell_type", group1="T cells")
```
### `mode` (Hypothesis Testing Mode)
**"vanilla" mode** (default): Point null hypothesis
- Tests if β = 0 exactly
- More sensitive, but may find trivially small effects
**"change" mode**: Composite null hypothesis
- Tests if |β| ≤ δ
- Requires biologically meaningful change
- Reduces false discoveries of tiny effects
```python
# Change mode with minimum effect size
de = model.differential_expression(
groupby="cell_type",
group1="T cells",
group2="B cells",
mode="change",
delta=0.25 # Minimum log fold-change
)
```
### `delta`
Minimum effect size threshold for "change" mode.
- Typical values: 0.25, 0.5, 0.7 (log scale)
- log2(1.5) ≈ 0.58 (1.5-fold change)
- log2(2) = 1.0 (2-fold change)
```python
# Require at least 1.5-fold change
de = model.differential_expression(
groupby="condition",
group1="disease",
group2="healthy",
mode="change",
delta=0.58 # log2(1.5)
)
```
### `fdr_target`
False discovery rate threshold (default: 0.05)
```python
# More stringent FDR control
de = model.differential_expression(
groupby="cell_type",
group1="T cells",
fdr_target=0.01
)
```
### `batch_correction`
Whether to perform batch correction during DE testing (default: True)
```python
# Test within a specific batch
de = model.differential_expression(
groupby="cell_type",
group1="T cells",
group2="B cells",
batch_correction=False
)
```
### `n_samples`
Number of posterior samples for estimation (default: 5000)
- More samples = more accurate but slower
- Reduce for speed, increase for precision
```python
# High precision analysis
de = model.differential_expression(
groupby="cell_type",
group1="T cells",
n_samples=10000
)
```
## Interpreting Results
### Output Columns
The results DataFrame contains several important columns:
**Effect Size Estimates**:
- `lfc_mean`: Mean log fold-change
- `lfc_median`: Median log fold-change
- `lfc_std`: Standard deviation of log fold-change
- `lfc_min`: Lower bound of effect size
- `lfc_max`: Upper bound of effect size
**Statistical Significance**:
- `bayes_factor`: Bayes factor for differential expression
- Higher values = stronger evidence
- >3 often considered meaningful
- `is_de_fdr_0.05`: Boolean indicating if gene is DE at FDR 0.05
- `is_de_fdr_0.1`: Boolean indicating if gene is DE at FDR 0.1
**Expression Levels**:
- `mean1`: Mean expression in group 1
- `mean2`: Mean expression in group 2
- `non_zeros_proportion1`: Proportion of non-zero cells in group 1
- `non_zeros_proportion2`: Proportion of non-zero cells in group 2
### Example Interpretation
```python
de_results = model.differential_expression(
groupby="cell_type",
group1="T cells",
group2="B cells"
)
# Find significantly upregulated genes in T cells
upreg_tcells = de_results[
(de_results["is_de_fdr_0.05"]) &
(de_results["lfc_mean"] > 0)
].sort_values("lfc_mean", ascending=False)
print(f"Upregulated genes in T cells: {len(upreg_tcells)}")
print(upreg_tcells.head(10))
# Find genes with large effect sizes
large_effect = de_results[
(de_results["is_de_fdr_0.05"]) &
(abs(de_results["lfc_mean"]) > 1) # 2-fold change
]
```
## Advanced Usage
### DE Within Specific Cells
```python
# Test DE only within a subset of cells
subset_indices = adata.obs["tissue"] == "lung"
de = model.differential_expression(
idx1=adata.obs["cell_type"] == "T cells" & subset_indices,
idx2=adata.obs["cell_type"] == "B cells" & subset_indices
)
```
### Batch-Specific DE
```python
# Test DE within each batch separately
batches = adata.obs["batch"].unique()
batch_de_results = {}
for batch in batches:
batch_idx = adata.obs["batch"] == batch
batch_de_results[batch] = model.differential_expression(
idx1=(adata.obs["condition"] == "treated") & batch_idx,
idx2=(adata.obs["condition"] == "control") & batch_idx
)
```
### Pseudo-bulk DE
```python
# Aggregate cells before DE testing
# Useful for low cell counts per group
de = model.differential_expression(
groupby="cell_type",
group1="rare_cell_type",
group2="common_cell_type",
n_samples=10000, # More samples for stability
batch_correction=True
)
```
## Visualization
### Volcano Plot
```python
import matplotlib.pyplot as plt
import numpy as np
de = model.differential_expression(
groupby="condition",
group1="treated",
group2="control"
)
# Volcano plot
plt.figure(figsize=(10, 6))
plt.scatter(
de["lfc_mean"],
-np.log10(1 / (de["bayes_factor"] + 1)),
c=de["is_de_fdr_0.05"],
cmap="coolwarm",
alpha=0.5
)
plt.xlabel("Log Fold Change")
plt.ylabel("-log10(1/Bayes Factor)")
plt.title("Volcano Plot: Treated vs Control")
plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
plt.show()
```
### Heatmap of Top DE Genes
```python
import seaborn as sns
# Get top DE genes
top_genes = de.sort_values("lfc_mean", ascending=False).head(50).index
# Get normalized expression
norm_expr = model.get_normalized_expression(
adata,
indices=adata.obs["condition"].isin(["treated", "control"]),
gene_list=top_genes
)
# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(
norm_expr.T,
cmap="viridis",
xticklabels=False,
yticklabels=top_genes
)
plt.title("Top 50 DE Genes")
plt.show()
```
### Ranked Gene Plot
```python
# Plot genes ranked by effect size
de_sorted = de.sort_values("lfc_mean", ascending=False)
plt.figure(figsize=(12, 6))
plt.plot(range(len(de_sorted)), de_sorted["lfc_mean"].values)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Gene Rank")
plt.ylabel("Log Fold Change")
plt.title("Genes Ranked by Effect Size")
plt.show()
```
## Comparison with Traditional Methods
### scvi-tools vs. Wilcoxon Test
```python
import scanpy as sc
# Traditional Wilcoxon test
sc.tl.rank_genes_groups(
adata,
groupby="cell_type",
method="wilcoxon",
key_added="wilcoxon"
)
# scvi-tools DE
de_scvi = model.differential_expression(
groupby="cell_type",
group1="T cells"
)
# Compare results
wilcox_results = sc.get.rank_genes_groups_df(adata, group="T cells", key="wilcoxon")
```
**Advantages of scvi-tools**:
- Accounts for batch effects automatically
- Handles zero-inflation properly
- Provides uncertainty quantification
- No arbitrary pseudocount needed
- Better statistical properties
**When to use Wilcoxon**:
- Very quick exploratory analysis
- Comparison with published results using Wilcoxon
## Multi-Modal DE
### Protein DE (totalVI)
```python
# Train totalVI on CITE-seq data
totalvi_model = scvi.model.TOTALVI(adata)
totalvi_model.train()
# RNA differential expression
rna_de = totalvi_model.differential_expression(
groupby="cell_type",
group1="T cells",
group2="B cells",
protein_expression=False # Default
)
# Protein differential expression
protein_de = totalvi_model.differential_expression(
groupby="cell_type",
group1="T cells",
group2="B cells",
protein_expression=True
)
print(f"DE genes: {rna_de['is_de_fdr_0.05'].sum()}")
print(f"DE proteins: {protein_de['is_de_fdr_0.05'].sum()}")
```
### Differential Accessibility (PeakVI)
```python
# Train PeakVI on ATAC-seq data
peakvi_model = scvi.model.PEAKVI(atac_adata)
peakvi_model.train()
# Differential accessibility
da = peakvi_model.differential_accessibility(
groupby="cell_type",
group1="T cells",
group2="B cells"
)
# Same interpretation as DE
```
## Handling Special Cases
### Low Cell Count Groups
```python
# Increase posterior samples for stability
de = model.differential_expression(
groupby="cell_type",
group1="rare_type", # e.g., 50 cells
group2="common_type", # e.g., 5000 cells
n_samples=10000
)
```
### Imbalanced Comparisons
```python
# When groups have very different sizes
# Use change mode to avoid tiny effects
de = model.differential_expression(
groupby="condition",
group1="rare_condition",
group2="common_condition",
mode="change",
delta=0.5
)
```
### Multiple Testing Correction
```python
# Already included via FDP control
# But can apply additional corrections
from statsmodels.stats.multitest import multipletests
# Bonferroni correction (very conservative)
_, pvals_corrected, _, _ = multipletests(
1 / (de["bayes_factor"] + 1),
method="bonferroni"
)
```
## Performance Considerations
### Speed Optimization
```python
# Faster DE testing for large datasets
de = model.differential_expression(
groupby="cell_type",
group1="T cells",
n_samples=1000, # Reduce samples
batch_size=512 # Increase batch size
)
```
### Memory Management
```python
# For very large datasets
# Test one comparison at a time rather than all pairwise
cell_types = adata.obs["cell_type"].unique()
for ct in cell_types:
de = model.differential_expression(
groupby="cell_type",
group1=ct
)
# Save results
de.to_csv(f"de_results_{ct}.csv")
```
## Best Practices
1. **Use "change" mode**: For biologically interpretable results
2. **Set appropriate delta**: Based on biological significance
3. **Check expression levels**: Filter lowly expressed genes
4. **Validate findings**: Check marker genes for sanity
5. **Visualize results**: Always plot top DE genes
6. **Report parameters**: Document mode, delta, FDR used
7. **Consider batch effects**: Use batch_correction=True
8. **Multiple comparisons**: Be aware of testing many groups
9. **Sample size**: Ensure sufficient cells per group (>50 recommended)
10. **Biological validation**: Follow up with functional experiments
## Example: Complete DE Analysis Workflow
```python
import scvi
import scanpy as sc
import matplotlib.pyplot as plt
# 1. Train model
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
model = scvi.model.SCVI(adata)
model.train()
# 2. Perform DE analysis
de_results = model.differential_expression(
groupby="cell_type",
group1="Disease_T_cells",
group2="Healthy_T_cells",
mode="change",
delta=0.5,
fdr_target=0.05
)
# 3. Filter and analyze
sig_genes = de_results[de_results["is_de_fdr_0.05"]]
upreg = sig_genes[sig_genes["lfc_mean"] > 0].sort_values("lfc_mean", ascending=False)
downreg = sig_genes[sig_genes["lfc_mean"] < 0].sort_values("lfc_mean")
print(f"Significant genes: {len(sig_genes)}")
print(f"Upregulated: {len(upreg)}")
print(f"Downregulated: {len(downreg)}")
# 4. Visualize top genes
top_genes = upreg.head(10).index.tolist() + downreg.head(10).index.tolist()
sc.pl.violin(
adata[adata.obs["cell_type"].isin(["Disease_T_cells", "Healthy_T_cells"])],
keys=top_genes,
groupby="cell_type",
rotation=90
)
# 5. Functional enrichment (using external tools)
# E.g., g:Profiler, DAVID, or gprofiler-official Python package
upreg_genes = upreg.head(100).index.tolist()
# Perform pathway analysis...
# 6. Save results
de_results.to_csv("de_results_disease_vs_healthy.csv")
upreg.to_csv("upregulated_genes.csv")
downreg.to_csv("downregulated_genes.csv")
```