zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

13 KiB

Raw Blame History

PyDESeq2 Workflow Guide

This document provides detailed step-by-step workflows for common PyDESeq2 analysis patterns.

Complete Differential Expression Analysis
Data Loading and Preparation
Single-Factor Analysis
Multi-Factor Analysis
Result Export and Visualization
Common Patterns and Best Practices
Troubleshooting

Complete Differential Expression Analysis

Overview

A standard PyDESeq2 analysis consists of 12 main steps across two phases:

Phase 1: Read Counts Modeling (Steps 1-7)

Normalization and dispersion estimation
Log fold-change fitting
Outlier detection

Phase 2: Statistical Analysis (Steps 8-12)

Wald testing
Multiple testing correction
Optional LFC shrinkage

Full Workflow Code

import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T  # Transpose if needed
metadata = pd.read_csv("metadata.csv", index_col=0)

# Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

# Initialize DeseqDataSet
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True
)

# Run normalization and fitting
dds.deseq2()

# Perform statistical testing
ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],
    alpha=0.05,
    cooks_filter=True,
    independent_filter=True
)
ds.summary()

# Optional: Apply LFC shrinkage for visualization
ds.lfc_shrink()

# Access results
results = ds.results_df
print(results.head())

Data Loading and Preparation

Loading CSV Files

Count data typically comes in genes × samples format but needs to be transposed:

import pandas as pd

# Load count matrix (genes × samples)
counts_df = pd.read_csv("counts.csv", index_col=0)

# Transpose to samples × genes
counts_df = counts_df.T

# Load metadata (already in samples × variables format)
metadata = pd.read_csv("metadata.csv", index_col=0)

Loading from Other Formats

From TSV:

counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
metadata = pd.read_csv("metadata.tsv", sep="\t", index_col=0)

From saved pickle:

import pickle

with open("counts.pkl", "rb") as f:
    counts_df = pickle.load(f)

with open("metadata.pkl", "rb") as f:
    metadata = pickle.load(f)

From AnnData:

import anndata as ad

adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(
    adata.X,
    index=adata.obs_names,
    columns=adata.var_names
)
metadata = adata.obs

Data Filtering

Filter genes with low counts:

# Remove genes with fewer than 10 total reads
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

Filter samples with missing metadata:

# Remove samples where 'condition' column is NA
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

Filter by multiple criteria:

# Keep only samples that meet all criteria
mask = (
    ~metadata.condition.isna() &
    (metadata.batch.isin(["batch1", "batch2"])) &
    (metadata.age >= 18)
)
counts_df = counts_df.loc[mask]
metadata = metadata.loc[mask]

Data Validation

Check data structure:

print(f"Counts shape: {counts_df.shape}")  # Should be (samples, genes)
print(f"Metadata shape: {metadata.shape}")  # Should be (samples, variables)
print(f"Indices match: {all(counts_df.index == metadata.index)}")

# Check for negative values
assert (counts_df >= 0).all().all(), "Counts must be non-negative"

# Check for non-integer values
assert counts_df.applymap(lambda x: x == int(x)).all().all(), "Counts must be integers"

Single-Factor Analysis

Simple Two-Group Comparison

Compare treated vs control samples:

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# Design: model expression as a function of condition
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition"
)

dds.deseq2()

# Test treated vs control
ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"]
)
ds.summary()

# Results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")

Multiple Pairwise Comparisons

When comparing multiple groups:

# Test each treatment vs control
treatments = ["treated_A", "treated_B", "treated_C"]
all_results = {}

for treatment in treatments:
    ds = DeseqStats(
        dds,
        contrast=["condition", treatment, "control"]
    )
    ds.summary()
    all_results[treatment] = ds.results_df

# Compare results across treatments
for name, results in all_results.items():
    sig = results[results.padj < 0.05]
    print(f"{name}: {len(sig)} significant genes")

Multi-Factor Analysis

Two-Factor Design

Account for batch effects while testing condition:

# Design includes both batch and condition
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~batch + condition"
)

dds.deseq2()

# Test condition effect while controlling for batch
ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"]
)
ds.summary()

Interaction Effects

Test whether treatment effect differs between groups:

# Design includes interaction term
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~group + condition + group:condition"
)

dds.deseq2()

# Test the interaction term
ds = DeseqStats(dds, contrast=["group:condition", ...])
ds.summary()

Continuous Covariates

Include continuous variables like age:

# Ensure age is numeric in metadata
metadata["age"] = pd.to_numeric(metadata["age"])

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~age + condition"
)

dds.deseq2()

Result Export and Visualization

Saving Results

Export as CSV:

# Save statistical results
ds.results_df.to_csv("deseq2_results.csv")

# Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")

# Save with sorted results
sorted_results = ds.results_df.sort_values("padj")
sorted_results.to_csv("sorted_results.csv")

Save DeseqDataSet:

import pickle

# Save as AnnData for later use
with open("dds_result.pkl", "wb") as f:
    pickle.dump(dds.to_picklable_anndata(), f)

Load saved results:

# Load results
results = pd.read_csv("deseq2_results.csv", index_col=0)

# Load AnnData
with open("dds_result.pkl", "rb") as f:
    adata = pickle.load(f)

Basic Visualization

Volcano plot:

import matplotlib.pyplot as plt
import numpy as np

results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(
    results.log2FoldChange,
    results["-log10(padj)"],
    alpha=0.5,
    s=10
)
plt.axhline(-np.log10(0.05), color='red', linestyle='--', label='padj=0.05')
plt.axvline(1, color='gray', linestyle='--')
plt.axvline(-1, color='gray', linestyle='--')
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)

MA plot:

plt.figure(figsize=(10, 6))
plt.scatter(
    np.log10(results.baseMean + 1),
    results.log2FoldChange,
    alpha=0.5,
    s=10,
    c=(results.padj < 0.05),
    cmap='bwr'
)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)

Common Patterns and Best Practices

1. Data Preprocessing Checklist

Before running PyDESeq2:

✓ Ensure counts are non-negative integers
✓ Verify samples × genes orientation
✓ Check that sample names match between counts and metadata
✓ Remove or handle missing metadata values
✓ Filter low-count genes (typically < 10 total reads)
✓ Verify experimental factors are properly encoded

2. Design Formula Best Practices

Order matters: Put adjustment variables before the variable of interest

# Correct: control for batch, test condition
design = "~batch + condition"

# Less ideal: condition listed first
design = "~condition + batch"

Use categorical for discrete variables:

# Ensure proper data types
metadata["condition"] = metadata["condition"].astype("category")
metadata["batch"] = metadata["batch"].astype("category")

3. Statistical Testing Guidelines

Set appropriate alpha:

# Standard significance threshold
ds = DeseqStats(dds, alpha=0.05)

# More stringent for exploratory analysis
ds = DeseqStats(dds, alpha=0.01)

Use independent filtering:

# Recommended: filter low-power tests
ds = DeseqStats(dds, independent_filter=True)

# Only disable if you have specific reasons
ds = DeseqStats(dds, independent_filter=False)

4. LFC Shrinkage

When to use:

For visualization (volcano plots, heatmaps)
For ranking genes by effect size
When prioritizing genes for follow-up

When NOT to use:

For reporting statistical significance (use unshrunken p-values)
For gene set enrichment analysis (typically uses unshrunken values)

# Save both versions
ds.results_df.to_csv("results_unshrunken.csv")
ds.lfc_shrink()
ds.results_df.to_csv("results_shrunken.csv")

5. Memory Management

For large datasets:

# Use parallel processing
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    n_cpus=4  # Adjust based on available cores
)

# Process in batches if needed
# (split genes into chunks, analyze separately, combine results)

Troubleshooting

Error: Index mismatch between counts and metadata

Problem: Sample names don't match

KeyError: Sample names in counts and metadata don't match

Solution:

# Check indices
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())

# Align if needed
common_samples = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common_samples]
metadata = metadata.loc[common_samples]

Error: All genes have zero counts

Problem: Data might need transposition

ValueError: All genes have zero total counts

Solution:

# Check data orientation
print(f"Counts shape: {counts_df.shape}")

# If genes > samples, likely needs transpose
if counts_df.shape[1] < counts_df.shape[0]:
    counts_df = counts_df.T

Warning: Many genes filtered out

Problem: Too many low-count genes removed

Check:

# See distribution of gene counts
print(counts_df.sum(axis=0).describe())

# Visualize
import matplotlib.pyplot as plt
plt.hist(counts_df.sum(axis=0), bins=50, log=True)
plt.xlabel("Total counts per gene")
plt.ylabel("Frequency")
plt.show()

Adjust filtering if needed:

# Try lower threshold
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 5]

Error: Design matrix is not full rank

Problem: Confounded design (e.g., all treated samples in one batch)

Solution:

# Check design confounding
print(pd.crosstab(metadata.condition, metadata.batch))

# Either remove confounded variable or add interaction term
design = "~condition"  # Drop batch
# OR
design = "~condition + batch + condition:batch"  # Add interaction

Issue: No significant genes found

Possible causes:

Small effect sizes
High biological variability
Insufficient sample size
Technical issues (batch effects, outliers)

Diagnostics:

# Check dispersion estimates
import matplotlib.pyplot as plt
dispersions = dds.varm["dispersions"]
plt.hist(dispersions, bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.show()

# Check size factors (should be close to 1)
print("Size factors:", dds.obsm["size_factors"])

# Look at top genes even if not significant
top_genes = ds.results_df.nsmallest(20, "pvalue")
print(top_genes)

Memory errors on large datasets

Solutions:

# 1. Use fewer CPUs (paradoxically can help)
dds = DeseqDataSet(..., n_cpus=1)

# 2. Filter more aggressively
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 20]

# 3. Process in batches
# Split analysis by gene subsets and combine results

13 KiB Raw Blame History Unescape Escape

PyDESeq2 Workflow Guide

Table of Contents

Complete Differential Expression Analysis

Overview

Full Workflow Code

Data Loading and Preparation

Loading CSV Files

Loading from Other Formats

Data Filtering

Data Validation

Single-Factor Analysis

Simple Two-Group Comparison

Multiple Pairwise Comparisons

Multi-Factor Analysis

Two-Factor Design

Interaction Effects

Continuous Covariates

Result Export and Visualization

Saving Results

Basic Visualization

Common Patterns and Best Practices

1. Data Preprocessing Checklist

2. Design Formula Best Practices

3. Statistical Testing Guidelines

4. LFC Shrinkage

5. Memory Management

Troubleshooting

Error: Index mismatch between counts and metadata

Error: All genes have zero counts

Warning: Many genes filtered out

Error: Design matrix is not full rank

Issue: No significant genes found

Memory errors on large datasets

13 KiB

Raw Blame History