Files
gh-k-dense-ai-claude-scient…/skills/pydeseq2/references/workflow_guide.md
2025-11-30 08:30:10 +08:00

583 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PyDESeq2 Workflow Guide
This document provides detailed step-by-step workflows for common PyDESeq2 analysis patterns.
## Table of Contents
1. [Complete Differential Expression Analysis](#complete-differential-expression-analysis)
2. [Data Loading and Preparation](#data-loading-and-preparation)
3. [Single-Factor Analysis](#single-factor-analysis)
4. [Multi-Factor Analysis](#multi-factor-analysis)
5. [Result Export and Visualization](#result-export-and-visualization)
6. [Common Patterns and Best Practices](#common-patterns-and-best-practices)
7. [Troubleshooting](#troubleshooting)
---
## Complete Differential Expression Analysis
### Overview
A standard PyDESeq2 analysis consists of 12 main steps across two phases:
**Phase 1: Read Counts Modeling (Steps 1-7)**
- Normalization and dispersion estimation
- Log fold-change fitting
- Outlier detection
**Phase 2: Statistical Analysis (Steps 8-12)**
- Wald testing
- Multiple testing correction
- Optional LFC shrinkage
### Full Workflow Code
```python
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose if needed
metadata = pd.read_csv("metadata.csv", index_col=0)
# Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
# Initialize DeseqDataSet
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True
)
# Run normalization and fitting
dds.deseq2()
# Perform statistical testing
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
alpha=0.05,
cooks_filter=True,
independent_filter=True
)
ds.summary()
# Optional: Apply LFC shrinkage for visualization
ds.lfc_shrink()
# Access results
results = ds.results_df
print(results.head())
```
---
## Data Loading and Preparation
### Loading CSV Files
Count data typically comes in genes × samples format but needs to be transposed:
```python
import pandas as pd
# Load count matrix (genes × samples)
counts_df = pd.read_csv("counts.csv", index_col=0)
# Transpose to samples × genes
counts_df = counts_df.T
# Load metadata (already in samples × variables format)
metadata = pd.read_csv("metadata.csv", index_col=0)
```
### Loading from Other Formats
**From TSV:**
```python
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
metadata = pd.read_csv("metadata.tsv", sep="\t", index_col=0)
```
**From saved pickle:**
```python
import pickle
with open("counts.pkl", "rb") as f:
counts_df = pickle.load(f)
with open("metadata.pkl", "rb") as f:
metadata = pickle.load(f)
```
**From AnnData:**
```python
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(
adata.X,
index=adata.obs_names,
columns=adata.var_names
)
metadata = adata.obs
```
### Data Filtering
**Filter genes with low counts:**
```python
# Remove genes with fewer than 10 total reads
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
```
**Filter samples with missing metadata:**
```python
# Remove samples where 'condition' column is NA
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
```
**Filter by multiple criteria:**
```python
# Keep only samples that meet all criteria
mask = (
~metadata.condition.isna() &
(metadata.batch.isin(["batch1", "batch2"])) &
(metadata.age >= 18)
)
counts_df = counts_df.loc[mask]
metadata = metadata.loc[mask]
```
### Data Validation
**Check data structure:**
```python
print(f"Counts shape: {counts_df.shape}") # Should be (samples, genes)
print(f"Metadata shape: {metadata.shape}") # Should be (samples, variables)
print(f"Indices match: {all(counts_df.index == metadata.index)}")
# Check for negative values
assert (counts_df >= 0).all().all(), "Counts must be non-negative"
# Check for non-integer values
assert counts_df.applymap(lambda x: x == int(x)).all().all(), "Counts must be integers"
```
---
## Single-Factor Analysis
### Simple Two-Group Comparison
Compare treated vs control samples:
```python
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# Design: model expression as a function of condition
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition"
)
dds.deseq2()
# Test treated vs control
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"]
)
ds.summary()
# Results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")
```
### Multiple Pairwise Comparisons
When comparing multiple groups:
```python
# Test each treatment vs control
treatments = ["treated_A", "treated_B", "treated_C"]
all_results = {}
for treatment in treatments:
ds = DeseqStats(
dds,
contrast=["condition", treatment, "control"]
)
ds.summary()
all_results[treatment] = ds.results_df
# Compare results across treatments
for name, results in all_results.items():
sig = results[results.padj < 0.05]
print(f"{name}: {len(sig)} significant genes")
```
---
## Multi-Factor Analysis
### Two-Factor Design
Account for batch effects while testing condition:
```python
# Design includes both batch and condition
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~batch + condition"
)
dds.deseq2()
# Test condition effect while controlling for batch
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"]
)
ds.summary()
```
### Interaction Effects
Test whether treatment effect differs between groups:
```python
# Design includes interaction term
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~group + condition + group:condition"
)
dds.deseq2()
# Test the interaction term
ds = DeseqStats(dds, contrast=["group:condition", ...])
ds.summary()
```
### Continuous Covariates
Include continuous variables like age:
```python
# Ensure age is numeric in metadata
metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~age + condition"
)
dds.deseq2()
```
---
## Result Export and Visualization
### Saving Results
**Export as CSV:**
```python
# Save statistical results
ds.results_df.to_csv("deseq2_results.csv")
# Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")
# Save with sorted results
sorted_results = ds.results_df.sort_values("padj")
sorted_results.to_csv("sorted_results.csv")
```
**Save DeseqDataSet:**
```python
import pickle
# Save as AnnData for later use
with open("dds_result.pkl", "wb") as f:
pickle.dump(dds.to_picklable_anndata(), f)
```
**Load saved results:**
```python
# Load results
results = pd.read_csv("deseq2_results.csv", index_col=0)
# Load AnnData
with open("dds_result.pkl", "rb") as f:
adata = pickle.load(f)
```
### Basic Visualization
**Volcano plot:**
```python
import matplotlib.pyplot as plt
import numpy as np
results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(
results.log2FoldChange,
results["-log10(padj)"],
alpha=0.5,
s=10
)
plt.axhline(-np.log10(0.05), color='red', linestyle='--', label='padj=0.05')
plt.axvline(1, color='gray', linestyle='--')
plt.axvline(-1, color='gray', linestyle='--')
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)
```
**MA plot:**
```python
plt.figure(figsize=(10, 6))
plt.scatter(
np.log10(results.baseMean + 1),
results.log2FoldChange,
alpha=0.5,
s=10,
c=(results.padj < 0.05),
cmap='bwr'
)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)
```
---
## Common Patterns and Best Practices
### 1. Data Preprocessing Checklist
Before running PyDESeq2:
- ✓ Ensure counts are non-negative integers
- ✓ Verify samples × genes orientation
- ✓ Check that sample names match between counts and metadata
- ✓ Remove or handle missing metadata values
- ✓ Filter low-count genes (typically < 10 total reads)
- ✓ Verify experimental factors are properly encoded
### 2. Design Formula Best Practices
**Order matters:** Put adjustment variables before the variable of interest
```python
# Correct: control for batch, test condition
design = "~batch + condition"
# Less ideal: condition listed first
design = "~condition + batch"
```
**Use categorical for discrete variables:**
```python
# Ensure proper data types
metadata["condition"] = metadata["condition"].astype("category")
metadata["batch"] = metadata["batch"].astype("category")
```
### 3. Statistical Testing Guidelines
**Set appropriate alpha:**
```python
# Standard significance threshold
ds = DeseqStats(dds, alpha=0.05)
# More stringent for exploratory analysis
ds = DeseqStats(dds, alpha=0.01)
```
**Use independent filtering:**
```python
# Recommended: filter low-power tests
ds = DeseqStats(dds, independent_filter=True)
# Only disable if you have specific reasons
ds = DeseqStats(dds, independent_filter=False)
```
### 4. LFC Shrinkage
**When to use:**
- For visualization (volcano plots, heatmaps)
- For ranking genes by effect size
- When prioritizing genes for follow-up
**When NOT to use:**
- For reporting statistical significance (use unshrunken p-values)
- For gene set enrichment analysis (typically uses unshrunken values)
```python
# Save both versions
ds.results_df.to_csv("results_unshrunken.csv")
ds.lfc_shrink()
ds.results_df.to_csv("results_shrunken.csv")
```
### 5. Memory Management
For large datasets:
```python
# Use parallel processing
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
n_cpus=4 # Adjust based on available cores
)
# Process in batches if needed
# (split genes into chunks, analyze separately, combine results)
```
---
## Troubleshooting
### Error: Index mismatch between counts and metadata
**Problem:** Sample names don't match
```
KeyError: Sample names in counts and metadata don't match
```
**Solution:**
```python
# Check indices
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())
# Align if needed
common_samples = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common_samples]
metadata = metadata.loc[common_samples]
```
### Error: All genes have zero counts
**Problem:** Data might need transposition
```
ValueError: All genes have zero total counts
```
**Solution:**
```python
# Check data orientation
print(f"Counts shape: {counts_df.shape}")
# If genes > samples, likely needs transpose
if counts_df.shape[1] < counts_df.shape[0]:
counts_df = counts_df.T
```
### Warning: Many genes filtered out
**Problem:** Too many low-count genes removed
**Check:**
```python
# See distribution of gene counts
print(counts_df.sum(axis=0).describe())
# Visualize
import matplotlib.pyplot as plt
plt.hist(counts_df.sum(axis=0), bins=50, log=True)
plt.xlabel("Total counts per gene")
plt.ylabel("Frequency")
plt.show()
```
**Adjust filtering if needed:**
```python
# Try lower threshold
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 5]
```
### Error: Design matrix is not full rank
**Problem:** Confounded design (e.g., all treated samples in one batch)
**Solution:**
```python
# Check design confounding
print(pd.crosstab(metadata.condition, metadata.batch))
# Either remove confounded variable or add interaction term
design = "~condition" # Drop batch
# OR
design = "~condition + batch + condition:batch" # Add interaction
```
### Issue: No significant genes found
**Possible causes:**
1. Small effect sizes
2. High biological variability
3. Insufficient sample size
4. Technical issues (batch effects, outliers)
**Diagnostics:**
```python
# Check dispersion estimates
import matplotlib.pyplot as plt
dispersions = dds.varm["dispersions"]
plt.hist(dispersions, bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.show()
# Check size factors (should be close to 1)
print("Size factors:", dds.obsm["size_factors"])
# Look at top genes even if not significant
top_genes = ds.results_df.nsmallest(20, "pvalue")
print(top_genes)
```
### Memory errors on large datasets
**Solutions:**
```python
# 1. Use fewer CPUs (paradoxically can help)
dds = DeseqDataSet(..., n_cpus=1)
# 2. Filter more aggressively
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 20]
# 3. Process in batches
# Split analysis by gene subsets and combine results
```