Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/scvi-tools/references/differential-expression.md
+++ b/skills/scvi-tools/references/differential-expression.md
@@ -0,0 +1,581 @@
+# Differential Expression Analysis in scvi-tools
+
+This document provides detailed information about differential expression (DE) analysis using scvi-tools' probabilistic framework.
+
+## Overview
+
+scvi-tools implements Bayesian differential expression testing that leverages the learned generative models to estimate expression differences between groups. This approach provides several advantages over traditional methods:
+
+- **Batch correction**: DE testing on batch-corrected representations
+- **Uncertainty quantification**: Probabilistic estimates of effect sizes
+- **Zero-inflation handling**: Proper modeling of dropout and zeros
+- **Flexible comparisons**: Between any groups or cell types
+- **Multiple modalities**: Works for RNA, proteins (totalVI), and accessibility (PeakVI)
+
+## Core Statistical Framework
+
+### Problem Definition
+
+The goal is to estimate the log fold-change in expression between two conditions:
+
+```
+log fold-change = log(μ_B) - log(μ_A)
+```
+
+Where μ_A and μ_B are the mean expression levels in conditions A and B.
+
+### Three-Stage Process
+
+**Stage 1: Estimating Expression Levels**
+- Sample from posterior distribution of cellular states
+- Generate expression values from the learned generative model
+- Aggregate across cells to get population-level estimates
+
+**Stage 2: Detecting Relevant Features (Hypothesis Testing)**
+- Test for differential expression using Bayesian framework
+- Two testing modes available:
+  - **"vanilla" mode**: Point null hypothesis (β = 0)
+  - **"change" mode**: Composite hypothesis (|β| ≤ δ)
+
+**Stage 3: Controlling False Discovery**
+- Posterior expected False Discovery Proportion (FDP) control
+- Selects maximum number of discoveries ensuring E[FDP] ≤ α
+
+## Basic Usage
+
+### Simple Two-Group Comparison
+
+```python
+import scvi
+
+# After training a model
+model = scvi.model.SCVI(adata)
+model.train()
+
+# Compare two cell types
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells"
+)
+
+# View top DE genes
+top_genes = de_results.sort_values("lfc_mean", ascending=False).head(20)
+print(top_genes[["lfc_mean", "lfc_std", "bayes_factor", "is_de_fdr_0.05"]])
+```
+
+### One vs. Rest Comparison
+
+```python
+# Compare one group against all others
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells"  # No group2 = compare to rest
+)
+```
+
+### All Pairwise Comparisons
+
+```python
+# Compare all cell types pairwise
+all_comparisons = {}
+
+cell_types = adata.obs["cell_type"].unique()
+
+for ct1 in cell_types:
+    for ct2 in cell_types:
+        if ct1 != ct2:
+            key = f"{ct1}_vs_{ct2}"
+            all_comparisons[key] = model.differential_expression(
+                groupby="cell_type",
+                group1=ct1,
+                group2=ct2
+            )
+```
+
+## Key Parameters
+
+### `groupby` (required)
+Column in `adata.obs` defining groups to compare.
+
+```python
+# Must be a categorical variable
+de_results = model.differential_expression(groupby="cell_type")
+```
+
+### `group1` and `group2`
+Groups to compare. If `group2` is None, compares `group1` to all others.
+
+```python
+# Specific comparison
+de = model.differential_expression(groupby="condition", group1="treated", group2="control")
+
+# One vs rest
+de = model.differential_expression(groupby="cell_type", group1="T cells")
+```
+
+### `mode` (Hypothesis Testing Mode)
+
+**"vanilla" mode** (default): Point null hypothesis
+- Tests if β = 0 exactly
+- More sensitive, but may find trivially small effects
+
+**"change" mode**: Composite null hypothesis
+- Tests if |β| ≤ δ
+- Requires biologically meaningful change
+- Reduces false discoveries of tiny effects
+
+```python
+# Change mode with minimum effect size
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    mode="change",
+    delta=0.25  # Minimum log fold-change
+)
+```
+
+### `delta`
+Minimum effect size threshold for "change" mode.
+- Typical values: 0.25, 0.5, 0.7 (log scale)
+- log2(1.5) ≈ 0.58 (1.5-fold change)
+- log2(2) = 1.0 (2-fold change)
+
+```python
+# Require at least 1.5-fold change
+de = model.differential_expression(
+    groupby="condition",
+    group1="disease",
+    group2="healthy",
+    mode="change",
+    delta=0.58  # log2(1.5)
+)
+```
+
+### `fdr_target`
+False discovery rate threshold (default: 0.05)
+
+```python
+# More stringent FDR control
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    fdr_target=0.01
+)
+```
+
+### `batch_correction`
+Whether to perform batch correction during DE testing (default: True)
+
+```python
+# Test within a specific batch
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    batch_correction=False
+)
+```
+
+### `n_samples`
+Number of posterior samples for estimation (default: 5000)
+- More samples = more accurate but slower
+- Reduce for speed, increase for precision
+
+```python
+# High precision analysis
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    n_samples=10000
+)
+```
+
+## Interpreting Results
+
+### Output Columns
+
+The results DataFrame contains several important columns:
+
+**Effect Size Estimates**:
+- `lfc_mean`: Mean log fold-change
+- `lfc_median`: Median log fold-change
+- `lfc_std`: Standard deviation of log fold-change
+- `lfc_min`: Lower bound of effect size
+- `lfc_max`: Upper bound of effect size
+
+**Statistical Significance**:
+- `bayes_factor`: Bayes factor for differential expression
+  - Higher values = stronger evidence
+  - >3 often considered meaningful
+- `is_de_fdr_0.05`: Boolean indicating if gene is DE at FDR 0.05
+- `is_de_fdr_0.1`: Boolean indicating if gene is DE at FDR 0.1
+
+**Expression Levels**:
+- `mean1`: Mean expression in group 1
+- `mean2`: Mean expression in group 2
+- `non_zeros_proportion1`: Proportion of non-zero cells in group 1
+- `non_zeros_proportion2`: Proportion of non-zero cells in group 2
+
+### Example Interpretation
+
+```python
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells"
+)
+
+# Find significantly upregulated genes in T cells
+upreg_tcells = de_results[
+    (de_results["is_de_fdr_0.05"]) &
+    (de_results["lfc_mean"] > 0)
+].sort_values("lfc_mean", ascending=False)
+
+print(f"Upregulated genes in T cells: {len(upreg_tcells)}")
+print(upreg_tcells.head(10))
+
+# Find genes with large effect sizes
+large_effect = de_results[
+    (de_results["is_de_fdr_0.05"]) &
+    (abs(de_results["lfc_mean"]) > 1)  # 2-fold change
+]
+```
+
+## Advanced Usage
+
+### DE Within Specific Cells
+
+```python
+# Test DE only within a subset of cells
+subset_indices = adata.obs["tissue"] == "lung"
+
+de = model.differential_expression(
+    idx1=adata.obs["cell_type"] == "T cells" & subset_indices,
+    idx2=adata.obs["cell_type"] == "B cells" & subset_indices
+)
+```
+
+### Batch-Specific DE
+
+```python
+# Test DE within each batch separately
+batches = adata.obs["batch"].unique()
+
+batch_de_results = {}
+for batch in batches:
+    batch_idx = adata.obs["batch"] == batch
+    batch_de_results[batch] = model.differential_expression(
+        idx1=(adata.obs["condition"] == "treated") & batch_idx,
+        idx2=(adata.obs["condition"] == "control") & batch_idx
+    )
+```
+
+### Pseudo-bulk DE
+
+```python
+# Aggregate cells before DE testing
+# Useful for low cell counts per group
+
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="rare_cell_type",
+    group2="common_cell_type",
+    n_samples=10000,  # More samples for stability
+    batch_correction=True
+)
+```
+
+## Visualization
+
+### Volcano Plot
+
+```python
+import matplotlib.pyplot as plt
+import numpy as np
+
+de = model.differential_expression(
+    groupby="condition",
+    group1="treated",
+    group2="control"
+)
+
+# Volcano plot
+plt.figure(figsize=(10, 6))
+plt.scatter(
+    de["lfc_mean"],
+    -np.log10(1 / (de["bayes_factor"] + 1)),
+    c=de["is_de_fdr_0.05"],
+    cmap="coolwarm",
+    alpha=0.5
+)
+plt.xlabel("Log Fold Change")
+plt.ylabel("-log10(1/Bayes Factor)")
+plt.title("Volcano Plot: Treated vs Control")
+plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
+plt.show()
+```
+
+### Heatmap of Top DE Genes
+
+```python
+import seaborn as sns
+
+# Get top DE genes
+top_genes = de.sort_values("lfc_mean", ascending=False).head(50).index
+
+# Get normalized expression
+norm_expr = model.get_normalized_expression(
+    adata,
+    indices=adata.obs["condition"].isin(["treated", "control"]),
+    gene_list=top_genes
+)
+
+# Plot heatmap
+plt.figure(figsize=(12, 10))
+sns.heatmap(
+    norm_expr.T,
+    cmap="viridis",
+    xticklabels=False,
+    yticklabels=top_genes
+)
+plt.title("Top 50 DE Genes")
+plt.show()
+```
+
+### Ranked Gene Plot
+
+```python
+# Plot genes ranked by effect size
+de_sorted = de.sort_values("lfc_mean", ascending=False)
+
+plt.figure(figsize=(12, 6))
+plt.plot(range(len(de_sorted)), de_sorted["lfc_mean"].values)
+plt.axhline(y=0, color='r', linestyle='--')
+plt.xlabel("Gene Rank")
+plt.ylabel("Log Fold Change")
+plt.title("Genes Ranked by Effect Size")
+plt.show()
+```
+
+## Comparison with Traditional Methods
+
+### scvi-tools vs. Wilcoxon Test
+
+```python
+import scanpy as sc
+
+# Traditional Wilcoxon test
+sc.tl.rank_genes_groups(
+    adata,
+    groupby="cell_type",
+    method="wilcoxon",
+    key_added="wilcoxon"
+)
+
+# scvi-tools DE
+de_scvi = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells"
+)
+
+# Compare results
+wilcox_results = sc.get.rank_genes_groups_df(adata, group="T cells", key="wilcoxon")
+```
+
+**Advantages of scvi-tools**:
+- Accounts for batch effects automatically
+- Handles zero-inflation properly
+- Provides uncertainty quantification
+- No arbitrary pseudocount needed
+- Better statistical properties
+
+**When to use Wilcoxon**:
+- Very quick exploratory analysis
+- Comparison with published results using Wilcoxon
+
+## Multi-Modal DE
+
+### Protein DE (totalVI)
+
+```python
+# Train totalVI on CITE-seq data
+totalvi_model = scvi.model.TOTALVI(adata)
+totalvi_model.train()
+
+# RNA differential expression
+rna_de = totalvi_model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    protein_expression=False  # Default
+)
+
+# Protein differential expression
+protein_de = totalvi_model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells",
+    protein_expression=True
+)
+
+print(f"DE genes: {rna_de['is_de_fdr_0.05'].sum()}")
+print(f"DE proteins: {protein_de['is_de_fdr_0.05'].sum()}")
+```
+
+### Differential Accessibility (PeakVI)
+
+```python
+# Train PeakVI on ATAC-seq data
+peakvi_model = scvi.model.PEAKVI(atac_adata)
+peakvi_model.train()
+
+# Differential accessibility
+da = peakvi_model.differential_accessibility(
+    groupby="cell_type",
+    group1="T cells",
+    group2="B cells"
+)
+
+# Same interpretation as DE
+```
+
+## Handling Special Cases
+
+### Low Cell Count Groups
+
+```python
+# Increase posterior samples for stability
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="rare_type",  # e.g., 50 cells
+    group2="common_type",  # e.g., 5000 cells
+    n_samples=10000
+)
+```
+
+### Imbalanced Comparisons
+
+```python
+# When groups have very different sizes
+# Use change mode to avoid tiny effects
+
+de = model.differential_expression(
+    groupby="condition",
+    group1="rare_condition",
+    group2="common_condition",
+    mode="change",
+    delta=0.5
+)
+```
+
+### Multiple Testing Correction
+
+```python
+# Already included via FDP control
+# But can apply additional corrections
+
+from statsmodels.stats.multitest import multipletests
+
+# Bonferroni correction (very conservative)
+_, pvals_corrected, _, _ = multipletests(
+    1 / (de["bayes_factor"] + 1),
+    method="bonferroni"
+)
+```
+
+## Performance Considerations
+
+### Speed Optimization
+
+```python
+# Faster DE testing for large datasets
+de = model.differential_expression(
+    groupby="cell_type",
+    group1="T cells",
+    n_samples=1000,  # Reduce samples
+    batch_size=512    # Increase batch size
+)
+```
+
+### Memory Management
+
+```python
+# For very large datasets
+# Test one comparison at a time rather than all pairwise
+
+cell_types = adata.obs["cell_type"].unique()
+for ct in cell_types:
+    de = model.differential_expression(
+        groupby="cell_type",
+        group1=ct
+    )
+    # Save results
+    de.to_csv(f"de_results_{ct}.csv")
+```
+
+## Best Practices
+
+1. **Use "change" mode**: For biologically interpretable results
+2. **Set appropriate delta**: Based on biological significance
+3. **Check expression levels**: Filter lowly expressed genes
+4. **Validate findings**: Check marker genes for sanity
+5. **Visualize results**: Always plot top DE genes
+6. **Report parameters**: Document mode, delta, FDR used
+7. **Consider batch effects**: Use batch_correction=True
+8. **Multiple comparisons**: Be aware of testing many groups
+9. **Sample size**: Ensure sufficient cells per group (>50 recommended)
+10. **Biological validation**: Follow up with functional experiments
+
+## Example: Complete DE Analysis Workflow
+
+```python
+import scvi
+import scanpy as sc
+import matplotlib.pyplot as plt
+
+# 1. Train model
+scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
+model = scvi.model.SCVI(adata)
+model.train()
+
+# 2. Perform DE analysis
+de_results = model.differential_expression(
+    groupby="cell_type",
+    group1="Disease_T_cells",
+    group2="Healthy_T_cells",
+    mode="change",
+    delta=0.5,
+    fdr_target=0.05
+)
+
+# 3. Filter and analyze
+sig_genes = de_results[de_results["is_de_fdr_0.05"]]
+upreg = sig_genes[sig_genes["lfc_mean"] > 0].sort_values("lfc_mean", ascending=False)
+downreg = sig_genes[sig_genes["lfc_mean"] < 0].sort_values("lfc_mean")
+
+print(f"Significant genes: {len(sig_genes)}")
+print(f"Upregulated: {len(upreg)}")
+print(f"Downregulated: {len(downreg)}")
+
+# 4. Visualize top genes
+top_genes = upreg.head(10).index.tolist() + downreg.head(10).index.tolist()
+
+sc.pl.violin(
+    adata[adata.obs["cell_type"].isin(["Disease_T_cells", "Healthy_T_cells"])],
+    keys=top_genes,
+    groupby="cell_type",
+    rotation=90
+)
+
+# 5. Functional enrichment (using external tools)
+# E.g., g:Profiler, DAVID, or gprofiler-official Python package
+upreg_genes = upreg.head(100).index.tolist()
+# Perform pathway analysis...
+
+# 6. Save results
+de_results.to_csv("de_results_disease_vs_healthy.csv")
+upreg.to_csv("upregulated_genes.csv")
+downreg.to_csv("downregulated_genes.csv")
+```