554 lines
16 KiB
Markdown
554 lines
16 KiB
Markdown
---
|
||
name: pydeseq2
|
||
description: "Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis."
|
||
---
|
||
|
||
# PyDESeq2
|
||
|
||
## Overview
|
||
|
||
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
|
||
|
||
## When to Use This Skill
|
||
|
||
This skill should be used when:
|
||
- Analyzing bulk RNA-seq count data for differential expression
|
||
- Comparing gene expression between experimental conditions (e.g., treated vs control)
|
||
- Performing multi-factor designs accounting for batch effects or covariates
|
||
- Converting R-based DESeq2 workflows to Python
|
||
- Integrating differential expression analysis into Python-based pipelines
|
||
- Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"
|
||
|
||
## Quick Start Workflow
|
||
|
||
For users who want to perform a standard differential expression analysis:
|
||
|
||
```python
|
||
import pandas as pd
|
||
from pydeseq2.dds import DeseqDataSet
|
||
from pydeseq2.ds import DeseqStats
|
||
|
||
# 1. Load data
|
||
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes
|
||
metadata = pd.read_csv("metadata.csv", index_col=0)
|
||
|
||
# 2. Filter low-count genes
|
||
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
|
||
counts_df = counts_df[genes_to_keep]
|
||
|
||
# 3. Initialize and fit DESeq2
|
||
dds = DeseqDataSet(
|
||
counts=counts_df,
|
||
metadata=metadata,
|
||
design="~condition",
|
||
refit_cooks=True
|
||
)
|
||
dds.deseq2()
|
||
|
||
# 4. Perform statistical testing
|
||
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
|
||
ds.summary()
|
||
|
||
# 5. Access results
|
||
results = ds.results_df
|
||
significant = results[results.padj < 0.05]
|
||
print(f"Found {len(significant)} significant genes")
|
||
```
|
||
|
||
## Core Workflow Steps
|
||
|
||
### Step 1: Data Preparation
|
||
|
||
**Input requirements:**
|
||
- **Count matrix:** Samples × genes DataFrame with non-negative integer read counts
|
||
- **Metadata:** Samples × variables DataFrame with experimental factors
|
||
|
||
**Common data loading patterns:**
|
||
|
||
```python
|
||
# From CSV (typical format: genes × samples, needs transpose)
|
||
counts_df = pd.read_csv("counts.csv", index_col=0).T
|
||
metadata = pd.read_csv("metadata.csv", index_col=0)
|
||
|
||
# From TSV
|
||
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
|
||
|
||
# From AnnData
|
||
import anndata as ad
|
||
adata = ad.read_h5ad("data.h5ad")
|
||
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
|
||
metadata = adata.obs
|
||
```
|
||
|
||
**Data filtering:**
|
||
|
||
```python
|
||
# Remove low-count genes
|
||
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
|
||
counts_df = counts_df[genes_to_keep]
|
||
|
||
# Remove samples with missing metadata
|
||
samples_to_keep = ~metadata.condition.isna()
|
||
counts_df = counts_df.loc[samples_to_keep]
|
||
metadata = metadata.loc[samples_to_keep]
|
||
```
|
||
|
||
### Step 2: Design Specification
|
||
|
||
The design formula specifies how gene expression is modeled.
|
||
|
||
**Single-factor designs:**
|
||
```python
|
||
design = "~condition" # Simple two-group comparison
|
||
```
|
||
|
||
**Multi-factor designs:**
|
||
```python
|
||
design = "~batch + condition" # Control for batch effects
|
||
design = "~age + condition" # Include continuous covariate
|
||
design = "~group + condition + group:condition" # Interaction effects
|
||
```
|
||
|
||
**Design formula guidelines:**
|
||
- Use Wilkinson formula notation (R-style)
|
||
- Put adjustment variables (e.g., batch) before the main variable of interest
|
||
- Ensure variables exist as columns in the metadata DataFrame
|
||
- Use appropriate data types (categorical for discrete variables)
|
||
|
||
### Step 3: DESeq2 Fitting
|
||
|
||
Initialize the DeseqDataSet and run the complete pipeline:
|
||
|
||
```python
|
||
from pydeseq2.dds import DeseqDataSet
|
||
|
||
dds = DeseqDataSet(
|
||
counts=counts_df,
|
||
metadata=metadata,
|
||
design="~condition",
|
||
refit_cooks=True, # Refit after removing outliers
|
||
n_cpus=1 # Parallel processing (adjust as needed)
|
||
)
|
||
|
||
# Run the complete DESeq2 pipeline
|
||
dds.deseq2()
|
||
```
|
||
|
||
**What `deseq2()` does:**
|
||
1. Computes size factors (normalization)
|
||
2. Fits genewise dispersions
|
||
3. Fits dispersion trend curve
|
||
4. Computes dispersion priors
|
||
5. Fits MAP dispersions (shrinkage)
|
||
6. Fits log fold changes
|
||
7. Calculates Cook's distances (outlier detection)
|
||
8. Refits if outliers detected (optional)
|
||
|
||
### Step 4: Statistical Testing
|
||
|
||
Perform Wald tests to identify differentially expressed genes:
|
||
|
||
```python
|
||
from pydeseq2.ds import DeseqStats
|
||
|
||
ds = DeseqStats(
|
||
dds,
|
||
contrast=["condition", "treated", "control"], # Test treated vs control
|
||
alpha=0.05, # Significance threshold
|
||
cooks_filter=True, # Filter outliers
|
||
independent_filter=True # Filter low-power tests
|
||
)
|
||
|
||
ds.summary()
|
||
```
|
||
|
||
**Contrast specification:**
|
||
- Format: `[variable, test_level, reference_level]`
|
||
- Example: `["condition", "treated", "control"]` tests treated vs control
|
||
- If `None`, uses the last coefficient in the design
|
||
|
||
**Result DataFrame columns:**
|
||
- `baseMean`: Mean normalized count across samples
|
||
- `log2FoldChange`: Log2 fold change between conditions
|
||
- `lfcSE`: Standard error of LFC
|
||
- `stat`: Wald test statistic
|
||
- `pvalue`: Raw p-value
|
||
- `padj`: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)
|
||
|
||
### Step 5: Optional LFC Shrinkage
|
||
|
||
Apply shrinkage to reduce noise in fold change estimates:
|
||
|
||
```python
|
||
ds.lfc_shrink() # Applies apeGLM shrinkage
|
||
```
|
||
|
||
**When to use LFC shrinkage:**
|
||
- For visualization (volcano plots, heatmaps)
|
||
- For ranking genes by effect size
|
||
- When prioritizing genes for follow-up experiments
|
||
|
||
**Important:** Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
|
||
|
||
### Step 6: Result Export
|
||
|
||
Save results and intermediate objects:
|
||
|
||
```python
|
||
import pickle
|
||
|
||
# Export results as CSV
|
||
ds.results_df.to_csv("deseq2_results.csv")
|
||
|
||
# Save significant genes only
|
||
significant = ds.results_df[ds.results_df.padj < 0.05]
|
||
significant.to_csv("significant_genes.csv")
|
||
|
||
# Save DeseqDataSet for later use
|
||
with open("dds_result.pkl", "wb") as f:
|
||
pickle.dump(dds.to_picklable_anndata(), f)
|
||
```
|
||
|
||
## Common Analysis Patterns
|
||
|
||
### Two-Group Comparison
|
||
|
||
Standard case-control comparison:
|
||
|
||
```python
|
||
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
|
||
dds.deseq2()
|
||
|
||
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
|
||
ds.summary()
|
||
|
||
results = ds.results_df
|
||
significant = results[results.padj < 0.05]
|
||
```
|
||
|
||
### Multiple Comparisons
|
||
|
||
Testing multiple treatment groups against control:
|
||
|
||
```python
|
||
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
|
||
dds.deseq2()
|
||
|
||
treatments = ["treatment_A", "treatment_B", "treatment_C"]
|
||
all_results = {}
|
||
|
||
for treatment in treatments:
|
||
ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
|
||
ds.summary()
|
||
all_results[treatment] = ds.results_df
|
||
|
||
sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
|
||
print(f"{treatment}: {sig_count} significant genes")
|
||
```
|
||
|
||
### Accounting for Batch Effects
|
||
|
||
Control for technical variation:
|
||
|
||
```python
|
||
# Include batch in design
|
||
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
|
||
dds.deseq2()
|
||
|
||
# Test condition while controlling for batch
|
||
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
|
||
ds.summary()
|
||
```
|
||
|
||
### Continuous Covariates
|
||
|
||
Include continuous variables like age or dosage:
|
||
|
||
```python
|
||
# Ensure continuous variable is numeric
|
||
metadata["age"] = pd.to_numeric(metadata["age"])
|
||
|
||
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
|
||
dds.deseq2()
|
||
|
||
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
|
||
ds.summary()
|
||
```
|
||
|
||
## Using the Analysis Script
|
||
|
||
This skill includes a complete command-line script for standard analyses:
|
||
|
||
```bash
|
||
# Basic usage
|
||
python scripts/run_deseq2_analysis.py \
|
||
--counts counts.csv \
|
||
--metadata metadata.csv \
|
||
--design "~condition" \
|
||
--contrast condition treated control \
|
||
--output results/
|
||
|
||
# With additional options
|
||
python scripts/run_deseq2_analysis.py \
|
||
--counts counts.csv \
|
||
--metadata metadata.csv \
|
||
--design "~batch + condition" \
|
||
--contrast condition treated control \
|
||
--output results/ \
|
||
--min-counts 10 \
|
||
--alpha 0.05 \
|
||
--n-cpus 4 \
|
||
--plots
|
||
```
|
||
|
||
**Script features:**
|
||
- Automatic data loading and validation
|
||
- Gene and sample filtering
|
||
- Complete DESeq2 pipeline execution
|
||
- Statistical testing with customizable parameters
|
||
- Result export (CSV, pickle)
|
||
- Optional visualization (volcano and MA plots)
|
||
|
||
Refer users to `scripts/run_deseq2_analysis.py` when they need a standalone analysis tool or want to batch process multiple datasets.
|
||
|
||
## Result Interpretation
|
||
|
||
### Identifying Significant Genes
|
||
|
||
```python
|
||
# Filter by adjusted p-value
|
||
significant = ds.results_df[ds.results_df.padj < 0.05]
|
||
|
||
# Filter by both significance and effect size
|
||
sig_and_large = ds.results_df[
|
||
(ds.results_df.padj < 0.05) &
|
||
(abs(ds.results_df.log2FoldChange) > 1)
|
||
]
|
||
|
||
# Separate up- and down-regulated
|
||
upregulated = significant[significant.log2FoldChange > 0]
|
||
downregulated = significant[significant.log2FoldChange < 0]
|
||
|
||
print(f"Upregulated: {len(upregulated)}")
|
||
print(f"Downregulated: {len(downregulated)}")
|
||
```
|
||
|
||
### Ranking and Sorting
|
||
|
||
```python
|
||
# Sort by adjusted p-value
|
||
top_by_padj = ds.results_df.sort_values("padj").head(20)
|
||
|
||
# Sort by absolute fold change (use shrunk values)
|
||
ds.lfc_shrink()
|
||
ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
|
||
top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)
|
||
|
||
# Sort by a combined metric
|
||
ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
|
||
top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
|
||
```
|
||
|
||
### Quality Metrics
|
||
|
||
```python
|
||
# Check normalization (size factors should be close to 1)
|
||
print("Size factors:", dds.obsm["size_factors"])
|
||
|
||
# Examine dispersion estimates
|
||
import matplotlib.pyplot as plt
|
||
plt.hist(dds.varm["dispersions"], bins=50)
|
||
plt.xlabel("Dispersion")
|
||
plt.ylabel("Frequency")
|
||
plt.title("Dispersion Distribution")
|
||
plt.show()
|
||
|
||
# Check p-value distribution (should be mostly flat with peak near 0)
|
||
plt.hist(ds.results_df.pvalue.dropna(), bins=50)
|
||
plt.xlabel("P-value")
|
||
plt.ylabel("Frequency")
|
||
plt.title("P-value Distribution")
|
||
plt.show()
|
||
```
|
||
|
||
## Visualization Guidelines
|
||
|
||
### Volcano Plot
|
||
|
||
Visualize significance vs effect size:
|
||
|
||
```python
|
||
import matplotlib.pyplot as plt
|
||
import numpy as np
|
||
|
||
results = ds.results_df.copy()
|
||
results["-log10(padj)"] = -np.log10(results.padj)
|
||
|
||
plt.figure(figsize=(10, 6))
|
||
significant = results.padj < 0.05
|
||
|
||
plt.scatter(
|
||
results.loc[~significant, "log2FoldChange"],
|
||
results.loc[~significant, "-log10(padj)"],
|
||
alpha=0.3, s=10, c='gray', label='Not significant'
|
||
)
|
||
plt.scatter(
|
||
results.loc[significant, "log2FoldChange"],
|
||
results.loc[significant, "-log10(padj)"],
|
||
alpha=0.6, s=10, c='red', label='padj < 0.05'
|
||
)
|
||
|
||
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
|
||
plt.xlabel("Log2 Fold Change")
|
||
plt.ylabel("-Log10(Adjusted P-value)")
|
||
plt.title("Volcano Plot")
|
||
plt.legend()
|
||
plt.savefig("volcano_plot.png", dpi=300)
|
||
```
|
||
|
||
### MA Plot
|
||
|
||
Show fold change vs mean expression:
|
||
|
||
```python
|
||
plt.figure(figsize=(10, 6))
|
||
|
||
plt.scatter(
|
||
np.log10(results.loc[~significant, "baseMean"] + 1),
|
||
results.loc[~significant, "log2FoldChange"],
|
||
alpha=0.3, s=10, c='gray'
|
||
)
|
||
plt.scatter(
|
||
np.log10(results.loc[significant, "baseMean"] + 1),
|
||
results.loc[significant, "log2FoldChange"],
|
||
alpha=0.6, s=10, c='red'
|
||
)
|
||
|
||
plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
|
||
plt.xlabel("Log10(Base Mean + 1)")
|
||
plt.ylabel("Log2 Fold Change")
|
||
plt.title("MA Plot")
|
||
plt.savefig("ma_plot.png", dpi=300)
|
||
```
|
||
|
||
## Troubleshooting Common Issues
|
||
|
||
### Data Format Problems
|
||
|
||
**Issue:** "Index mismatch between counts and metadata"
|
||
|
||
**Solution:** Ensure sample names match exactly
|
||
```python
|
||
print("Counts samples:", counts_df.index.tolist())
|
||
print("Metadata samples:", metadata.index.tolist())
|
||
|
||
# Take intersection if needed
|
||
common = counts_df.index.intersection(metadata.index)
|
||
counts_df = counts_df.loc[common]
|
||
metadata = metadata.loc[common]
|
||
```
|
||
|
||
**Issue:** "All genes have zero counts"
|
||
|
||
**Solution:** Check if data needs transposition
|
||
```python
|
||
print(f"Counts shape: {counts_df.shape}")
|
||
# If genes > samples, transpose is needed
|
||
if counts_df.shape[1] < counts_df.shape[0]:
|
||
counts_df = counts_df.T
|
||
```
|
||
|
||
### Design Matrix Issues
|
||
|
||
**Issue:** "Design matrix is not full rank"
|
||
|
||
**Cause:** Confounded variables (e.g., all treated samples in one batch)
|
||
|
||
**Solution:** Remove confounded variable or add interaction term
|
||
```python
|
||
# Check confounding
|
||
print(pd.crosstab(metadata.condition, metadata.batch))
|
||
|
||
# Either simplify design or add interaction
|
||
design = "~condition" # Remove batch
|
||
# OR
|
||
design = "~condition + batch + condition:batch" # Model interaction
|
||
```
|
||
|
||
### No Significant Genes
|
||
|
||
**Diagnostics:**
|
||
```python
|
||
# Check dispersion distribution
|
||
plt.hist(dds.varm["dispersions"], bins=50)
|
||
plt.show()
|
||
|
||
# Check size factors
|
||
print(dds.obsm["size_factors"])
|
||
|
||
# Look at top genes by raw p-value
|
||
print(ds.results_df.nsmallest(20, "pvalue"))
|
||
```
|
||
|
||
**Possible causes:**
|
||
- Small effect sizes
|
||
- High biological variability
|
||
- Insufficient sample size
|
||
- Technical issues (batch effects, outliers)
|
||
|
||
## Reference Documentation
|
||
|
||
For comprehensive details beyond this workflow-oriented guide:
|
||
|
||
- **API Reference** (`references/api_reference.md`): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.
|
||
|
||
- **Workflow Guide** (`references/workflow_guide.md`): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.
|
||
|
||
Load these references into context when users need:
|
||
- Detailed API documentation: `Read references/api_reference.md`
|
||
- Comprehensive workflow examples: `Read references/workflow_guide.md`
|
||
- Troubleshooting guidance: `Read references/workflow_guide.md` (see Troubleshooting section)
|
||
|
||
## Key Reminders
|
||
|
||
1. **Data orientation matters:** Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with `.T` if needed.
|
||
|
||
2. **Sample filtering:** Remove samples with missing metadata before analysis to avoid errors.
|
||
|
||
3. **Gene filtering:** Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.
|
||
|
||
4. **Design formula order:** Put adjustment variables before the variable of interest (e.g., `"~batch + condition"` not `"~condition + batch"`).
|
||
|
||
5. **LFC shrinkage timing:** Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.
|
||
|
||
6. **Result interpretation:** Use `padj < 0.05` for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.
|
||
|
||
7. **Contrast specification:** The format is `[variable, test_level, reference_level]` where test_level is compared against reference_level.
|
||
|
||
8. **Save intermediate objects:** Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.
|
||
|
||
## Installation and Requirements
|
||
|
||
```bash
|
||
uv pip install pydeseq2
|
||
```
|
||
|
||
**System requirements:**
|
||
- Python 3.10-3.11
|
||
- pandas 1.4.3+
|
||
- numpy 1.23.0+
|
||
- scipy 1.11.0+
|
||
- scikit-learn 1.1.1+
|
||
- anndata 0.8.0+
|
||
|
||
**Optional for visualization:**
|
||
- matplotlib
|
||
- seaborn
|
||
|
||
## Additional Resources
|
||
|
||
- **Official Documentation:** https://pydeseq2.readthedocs.io
|
||
- **GitHub Repository:** https://github.com/owkin/PyDESeq2
|
||
- **Publication:** Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
|
||
- **Original DESeq2 (R):** Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8
|