zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

7.0 KiB

Raw Permalink Blame History

PyDESeq2 API Reference

This document provides comprehensive API reference for PyDESeq2 classes, methods, and utilities.

Core Classes

DeseqDataSet

The main class for differential expression analysis that handles data processing from normalization through log-fold change fitting.

Purpose: Implements dispersion and log fold-change (LFC) estimation for RNA-seq count data.

Initialization Parameters:

counts: pandas DataFrame of shape (samples × genes) containing non-negative integer read counts
metadata: pandas DataFrame of shape (samples × variables) with sample annotations
design: str, Wilkinson formula specifying the statistical model (e.g., "~condition", "~group + condition")
refit_cooks: bool, whether to refit parameters after removing Cook's distance outliers (default: True)
n_cpus: int, number of CPUs to use for parallel processing (optional)
quiet: bool, suppress progress messages (default: False)

Key Methods:

`deseq2()`

Run the complete DESeq2 pipeline for normalization and dispersion/LFC fitting.

Steps performed:

Compute normalization factors (size factors)
Fit genewise dispersions
Fit dispersion trend curve
Calculate dispersion priors
Fit MAP (maximum a posteriori) dispersions
Fit log fold changes
Calculate Cook's distances for outlier detection
Optionally refit if refit_cooks=True

Returns: None (modifies object in-place)

`to_picklable_anndata()`

Convert the DeseqDataSet to an AnnData object that can be saved with pickle.

Returns: AnnData object with:

X: count data matrix
obs: sample-level metadata (1D)
var: gene-level metadata (1D)
varm: gene-level multi-dimensional data (e.g., LFC estimates)

Usage:

import pickle
with open("result_adata.pkl", "wb") as f:
    pickle.dump(dds.to_picklable_anndata(), f)

Attributes (after running deseq2()):

layers: dict containing various matrices (normalized counts, etc.)
varm: dict containing gene-level results (log fold changes, dispersions, etc.)
obsm: dict containing sample-level information
uns: dict containing global parameters

DeseqStats

Class for performing statistical tests and computing p-values for differential expression.

Purpose: Facilitates PyDESeq2 statistical tests using Wald tests and optional LFC shrinkage.

Initialization Parameters:

dds: DeseqDataSet object that has been processed with deseq2()
contrast: list or None, specifies the contrast for testing
- Format: [variable, test_level, reference_level]
- Example: ["condition", "treated", "control"] tests treated vs control
- If None, uses the last coefficient in the design formula
alpha: float, significance threshold for independent filtering (default: 0.05)
cooks_filter: bool, whether to filter outliers based on Cook's distance (default: True)
independent_filter: bool, whether to perform independent filtering (default: True)
n_cpus: int, number of CPUs for parallel processing (optional)
quiet: bool, suppress progress messages (default: False)

Key Methods:

`summary()`

Run Wald tests and compute p-values and adjusted p-values.

Steps performed:

Run Wald statistical tests for specified contrast
Optional Cook's distance filtering
Optional independent filtering to remove low-power tests
Multiple testing correction (Benjamini-Hochberg procedure)

Returns: None (results stored in results_df attribute)

Result DataFrame columns:

baseMean: mean normalized count across all samples
log2FoldChange: log2 fold change between conditions
lfcSE: standard error of the log2 fold change
stat: Wald test statistic
pvalue: raw p-value
padj: adjusted p-value (FDR-corrected)

`lfc_shrink(coeff=None)`

Apply shrinkage to log fold changes using the apeGLM method.

Purpose: Reduces noise in LFC estimates for better visualization and ranking, especially for genes with low counts or high variability.

Parameters:

coeff: str or None, coefficient name to shrink (if None, uses the coefficient from the contrast)

Important: Shrinkage is applied only for visualization/ranking purposes. The statistical test results (p-values, adjusted p-values) remain unchanged.

Returns: None (updates results_df with shrunk LFCs)

Attributes:

results_df: pandas DataFrame containing test results (available after summary())

Utility Functions

`pydeseq2.utils.load_example_data(modality="single-factor")`

Load synthetic example datasets for testing and tutorials.

Parameters:

modality: str, either "single-factor" or "multi-factor"

Returns: tuple of (counts_df, metadata_df)

counts_df: pandas DataFrame with synthetic count data
metadata_df: pandas DataFrame with sample annotations

Preprocessing Module

The pydeseq2.preprocessing module provides utilities for data preparation.

Common operations:

Gene filtering based on minimum read counts
Sample filtering based on metadata criteria
Data transformation and normalization

Inference Classes

Inference

Abstract base class defining the interface for DESeq2-related inference methods.

DefaultInference

Default implementation of inference methods using scipy, sklearn, and numpy.

Purpose: Provides the mathematical implementations for:

GLM (Generalized Linear Model) fitting
Dispersion estimation
Trend curve fitting
Statistical testing

Data Structure Requirements

Count Matrix

Shape: (samples × genes)
Type: pandas DataFrame
Values: Non-negative integers (raw read counts)
Index: Sample identifiers (must match metadata index)
Columns: Gene identifiers

Metadata

Shape: (samples × variables)
Type: pandas DataFrame
Index: Sample identifiers (must match count matrix index)
Columns: Experimental factors (e.g., "condition", "batch", "group")
Values: Categorical or continuous variables used in the design formula

Important Notes

Sample order must match between counts and metadata
Missing values in metadata should be handled before analysis
Gene names should be unique
Count files often need transposition: counts_df = counts_df.T

Common Workflow Pattern

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# 1. Initialize dataset
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True
)

# 2. Fit dispersions and LFCs
dds.deseq2()

# 3. Perform statistical testing
ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],
    alpha=0.05
)
ds.summary()

# 4. Optional: Shrink LFCs for visualization
ds.lfc_shrink()

# 5. Access results
results = ds.results_df

Version Compatibility

PyDESeq2 aims to match the default settings of DESeq2 v1.34.0. Some differences may exist as it is a from-scratch reimplementation in Python.

Tested with:

Python 3.10-3.11
anndata 0.8.0+
numpy 1.23.0+
pandas 1.4.3+
scikit-learn 1.1.1+
scipy 1.11.0+

7.0 KiB Raw Permalink Blame History Unescape Escape