Initial commit
This commit is contained in:
228
skills/pydeseq2/references/api_reference.md
Normal file
228
skills/pydeseq2/references/api_reference.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# PyDESeq2 API Reference
|
||||
|
||||
This document provides comprehensive API reference for PyDESeq2 classes, methods, and utilities.
|
||||
|
||||
## Core Classes
|
||||
|
||||
### DeseqDataSet
|
||||
|
||||
The main class for differential expression analysis that handles data processing from normalization through log-fold change fitting.
|
||||
|
||||
**Purpose:** Implements dispersion and log fold-change (LFC) estimation for RNA-seq count data.
|
||||
|
||||
**Initialization Parameters:**
|
||||
- `counts`: pandas DataFrame of shape (samples × genes) containing non-negative integer read counts
|
||||
- `metadata`: pandas DataFrame of shape (samples × variables) with sample annotations
|
||||
- `design`: str, Wilkinson formula specifying the statistical model (e.g., "~condition", "~group + condition")
|
||||
- `refit_cooks`: bool, whether to refit parameters after removing Cook's distance outliers (default: True)
|
||||
- `n_cpus`: int, number of CPUs to use for parallel processing (optional)
|
||||
- `quiet`: bool, suppress progress messages (default: False)
|
||||
|
||||
**Key Methods:**
|
||||
|
||||
#### `deseq2()`
|
||||
Run the complete DESeq2 pipeline for normalization and dispersion/LFC fitting.
|
||||
|
||||
**Steps performed:**
|
||||
1. Compute normalization factors (size factors)
|
||||
2. Fit genewise dispersions
|
||||
3. Fit dispersion trend curve
|
||||
4. Calculate dispersion priors
|
||||
5. Fit MAP (maximum a posteriori) dispersions
|
||||
6. Fit log fold changes
|
||||
7. Calculate Cook's distances for outlier detection
|
||||
8. Optionally refit if `refit_cooks=True`
|
||||
|
||||
**Returns:** None (modifies object in-place)
|
||||
|
||||
#### `to_picklable_anndata()`
|
||||
Convert the DeseqDataSet to an AnnData object that can be saved with pickle.
|
||||
|
||||
**Returns:** AnnData object with:
|
||||
- `X`: count data matrix
|
||||
- `obs`: sample-level metadata (1D)
|
||||
- `var`: gene-level metadata (1D)
|
||||
- `varm`: gene-level multi-dimensional data (e.g., LFC estimates)
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
import pickle
|
||||
with open("result_adata.pkl", "wb") as f:
|
||||
pickle.dump(dds.to_picklable_anndata(), f)
|
||||
```
|
||||
|
||||
**Attributes (after running deseq2()):**
|
||||
- `layers`: dict containing various matrices (normalized counts, etc.)
|
||||
- `varm`: dict containing gene-level results (log fold changes, dispersions, etc.)
|
||||
- `obsm`: dict containing sample-level information
|
||||
- `uns`: dict containing global parameters
|
||||
|
||||
---
|
||||
|
||||
### DeseqStats
|
||||
|
||||
Class for performing statistical tests and computing p-values for differential expression.
|
||||
|
||||
**Purpose:** Facilitates PyDESeq2 statistical tests using Wald tests and optional LFC shrinkage.
|
||||
|
||||
**Initialization Parameters:**
|
||||
- `dds`: DeseqDataSet object that has been processed with `deseq2()`
|
||||
- `contrast`: list or None, specifies the contrast for testing
|
||||
- Format: `[variable, test_level, reference_level]`
|
||||
- Example: `["condition", "treated", "control"]` tests treated vs control
|
||||
- If None, uses the last coefficient in the design formula
|
||||
- `alpha`: float, significance threshold for independent filtering (default: 0.05)
|
||||
- `cooks_filter`: bool, whether to filter outliers based on Cook's distance (default: True)
|
||||
- `independent_filter`: bool, whether to perform independent filtering (default: True)
|
||||
- `n_cpus`: int, number of CPUs for parallel processing (optional)
|
||||
- `quiet`: bool, suppress progress messages (default: False)
|
||||
|
||||
**Key Methods:**
|
||||
|
||||
#### `summary()`
|
||||
Run Wald tests and compute p-values and adjusted p-values.
|
||||
|
||||
**Steps performed:**
|
||||
1. Run Wald statistical tests for specified contrast
|
||||
2. Optional Cook's distance filtering
|
||||
3. Optional independent filtering to remove low-power tests
|
||||
4. Multiple testing correction (Benjamini-Hochberg procedure)
|
||||
|
||||
**Returns:** None (results stored in `results_df` attribute)
|
||||
|
||||
**Result DataFrame columns:**
|
||||
- `baseMean`: mean normalized count across all samples
|
||||
- `log2FoldChange`: log2 fold change between conditions
|
||||
- `lfcSE`: standard error of the log2 fold change
|
||||
- `stat`: Wald test statistic
|
||||
- `pvalue`: raw p-value
|
||||
- `padj`: adjusted p-value (FDR-corrected)
|
||||
|
||||
#### `lfc_shrink(coeff=None)`
|
||||
Apply shrinkage to log fold changes using the apeGLM method.
|
||||
|
||||
**Purpose:** Reduces noise in LFC estimates for better visualization and ranking, especially for genes with low counts or high variability.
|
||||
|
||||
**Parameters:**
|
||||
- `coeff`: str or None, coefficient name to shrink (if None, uses the coefficient from the contrast)
|
||||
|
||||
**Important:** Shrinkage is applied only for visualization/ranking purposes. The statistical test results (p-values, adjusted p-values) remain unchanged.
|
||||
|
||||
**Returns:** None (updates `results_df` with shrunk LFCs)
|
||||
|
||||
**Attributes:**
|
||||
- `results_df`: pandas DataFrame containing test results (available after `summary()`)
|
||||
|
||||
---
|
||||
|
||||
## Utility Functions
|
||||
|
||||
### `pydeseq2.utils.load_example_data(modality="single-factor")`
|
||||
|
||||
Load synthetic example datasets for testing and tutorials.
|
||||
|
||||
**Parameters:**
|
||||
- `modality`: str, either "single-factor" or "multi-factor"
|
||||
|
||||
**Returns:** tuple of (counts_df, metadata_df)
|
||||
- `counts_df`: pandas DataFrame with synthetic count data
|
||||
- `metadata_df`: pandas DataFrame with sample annotations
|
||||
|
||||
---
|
||||
|
||||
## Preprocessing Module
|
||||
|
||||
The `pydeseq2.preprocessing` module provides utilities for data preparation.
|
||||
|
||||
**Common operations:**
|
||||
- Gene filtering based on minimum read counts
|
||||
- Sample filtering based on metadata criteria
|
||||
- Data transformation and normalization
|
||||
|
||||
---
|
||||
|
||||
## Inference Classes
|
||||
|
||||
### Inference
|
||||
Abstract base class defining the interface for DESeq2-related inference methods.
|
||||
|
||||
### DefaultInference
|
||||
Default implementation of inference methods using scipy, sklearn, and numpy.
|
||||
|
||||
**Purpose:** Provides the mathematical implementations for:
|
||||
- GLM (Generalized Linear Model) fitting
|
||||
- Dispersion estimation
|
||||
- Trend curve fitting
|
||||
- Statistical testing
|
||||
|
||||
---
|
||||
|
||||
## Data Structure Requirements
|
||||
|
||||
### Count Matrix
|
||||
- **Shape:** (samples × genes)
|
||||
- **Type:** pandas DataFrame
|
||||
- **Values:** Non-negative integers (raw read counts)
|
||||
- **Index:** Sample identifiers (must match metadata index)
|
||||
- **Columns:** Gene identifiers
|
||||
|
||||
### Metadata
|
||||
- **Shape:** (samples × variables)
|
||||
- **Type:** pandas DataFrame
|
||||
- **Index:** Sample identifiers (must match count matrix index)
|
||||
- **Columns:** Experimental factors (e.g., "condition", "batch", "group")
|
||||
- **Values:** Categorical or continuous variables used in the design formula
|
||||
|
||||
### Important Notes
|
||||
- Sample order must match between counts and metadata
|
||||
- Missing values in metadata should be handled before analysis
|
||||
- Gene names should be unique
|
||||
- Count files often need transposition: `counts_df = counts_df.T`
|
||||
|
||||
---
|
||||
|
||||
## Common Workflow Pattern
|
||||
|
||||
```python
|
||||
from pydeseq2.dds import DeseqDataSet
|
||||
from pydeseq2.ds import DeseqStats
|
||||
|
||||
# 1. Initialize dataset
|
||||
dds = DeseqDataSet(
|
||||
counts=counts_df,
|
||||
metadata=metadata,
|
||||
design="~condition",
|
||||
refit_cooks=True
|
||||
)
|
||||
|
||||
# 2. Fit dispersions and LFCs
|
||||
dds.deseq2()
|
||||
|
||||
# 3. Perform statistical testing
|
||||
ds = DeseqStats(
|
||||
dds,
|
||||
contrast=["condition", "treated", "control"],
|
||||
alpha=0.05
|
||||
)
|
||||
ds.summary()
|
||||
|
||||
# 4. Optional: Shrink LFCs for visualization
|
||||
ds.lfc_shrink()
|
||||
|
||||
# 5. Access results
|
||||
results = ds.results_df
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
PyDESeq2 aims to match the default settings of DESeq2 v1.34.0. Some differences may exist as it is a from-scratch reimplementation in Python.
|
||||
|
||||
**Tested with:**
|
||||
- Python 3.10-3.11
|
||||
- anndata 0.8.0+
|
||||
- numpy 1.23.0+
|
||||
- pandas 1.4.3+
|
||||
- scikit-learn 1.1.1+
|
||||
- scipy 1.11.0+
|
||||
Reference in New Issue
Block a user