Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/pydeseq2/references/api_reference.md
+++ b/skills/pydeseq2/references/api_reference.md
@@ -0,0 +1,228 @@
+# PyDESeq2 API Reference
+
+This document provides comprehensive API reference for PyDESeq2 classes, methods, and utilities.
+
+## Core Classes
+
+### DeseqDataSet
+
+The main class for differential expression analysis that handles data processing from normalization through log-fold change fitting.
+
+**Purpose:** Implements dispersion and log fold-change (LFC) estimation for RNA-seq count data.
+
+**Initialization Parameters:**
+- `counts`: pandas DataFrame of shape (samples × genes) containing non-negative integer read counts
+- `metadata`: pandas DataFrame of shape (samples × variables) with sample annotations
+- `design`: str, Wilkinson formula specifying the statistical model (e.g., "~condition", "~group + condition")
+- `refit_cooks`: bool, whether to refit parameters after removing Cook's distance outliers (default: True)
+- `n_cpus`: int, number of CPUs to use for parallel processing (optional)
+- `quiet`: bool, suppress progress messages (default: False)
+
+**Key Methods:**
+
+#### `deseq2()`
+Run the complete DESeq2 pipeline for normalization and dispersion/LFC fitting.
+
+**Steps performed:**
+1. Compute normalization factors (size factors)
+2. Fit genewise dispersions
+3. Fit dispersion trend curve
+4. Calculate dispersion priors
+5. Fit MAP (maximum a posteriori) dispersions
+6. Fit log fold changes
+7. Calculate Cook's distances for outlier detection
+8. Optionally refit if `refit_cooks=True`
+
+**Returns:** None (modifies object in-place)
+
+#### `to_picklable_anndata()`
+Convert the DeseqDataSet to an AnnData object that can be saved with pickle.
+
+**Returns:** AnnData object with:
+- `X`: count data matrix
+- `obs`: sample-level metadata (1D)
+- `var`: gene-level metadata (1D)
+- `varm`: gene-level multi-dimensional data (e.g., LFC estimates)
+
+**Usage:**
+```python
+import pickle
+with open("result_adata.pkl", "wb") as f:
+    pickle.dump(dds.to_picklable_anndata(), f)
+```
+
+**Attributes (after running deseq2()):**
+- `layers`: dict containing various matrices (normalized counts, etc.)
+- `varm`: dict containing gene-level results (log fold changes, dispersions, etc.)
+- `obsm`: dict containing sample-level information
+- `uns`: dict containing global parameters
+
+---
+
+### DeseqStats
+
+Class for performing statistical tests and computing p-values for differential expression.
+
+**Purpose:** Facilitates PyDESeq2 statistical tests using Wald tests and optional LFC shrinkage.
+
+**Initialization Parameters:**
+- `dds`: DeseqDataSet object that has been processed with `deseq2()`
+- `contrast`: list or None, specifies the contrast for testing
+  - Format: `[variable, test_level, reference_level]`
+  - Example: `["condition", "treated", "control"]` tests treated vs control
+  - If None, uses the last coefficient in the design formula
+- `alpha`: float, significance threshold for independent filtering (default: 0.05)
+- `cooks_filter`: bool, whether to filter outliers based on Cook's distance (default: True)
+- `independent_filter`: bool, whether to perform independent filtering (default: True)
+- `n_cpus`: int, number of CPUs for parallel processing (optional)
+- `quiet`: bool, suppress progress messages (default: False)
+
+**Key Methods:**
+
+#### `summary()`
+Run Wald tests and compute p-values and adjusted p-values.
+
+**Steps performed:**
+1. Run Wald statistical tests for specified contrast
+2. Optional Cook's distance filtering
+3. Optional independent filtering to remove low-power tests
+4. Multiple testing correction (Benjamini-Hochberg procedure)
+
+**Returns:** None (results stored in `results_df` attribute)
+
+**Result DataFrame columns:**
+- `baseMean`: mean normalized count across all samples
+- `log2FoldChange`: log2 fold change between conditions
+- `lfcSE`: standard error of the log2 fold change
+- `stat`: Wald test statistic
+- `pvalue`: raw p-value
+- `padj`: adjusted p-value (FDR-corrected)
+
+#### `lfc_shrink(coeff=None)`
+Apply shrinkage to log fold changes using the apeGLM method.
+
+**Purpose:** Reduces noise in LFC estimates for better visualization and ranking, especially for genes with low counts or high variability.
+
+**Parameters:**
+- `coeff`: str or None, coefficient name to shrink (if None, uses the coefficient from the contrast)
+
+**Important:** Shrinkage is applied only for visualization/ranking purposes. The statistical test results (p-values, adjusted p-values) remain unchanged.
+
+**Returns:** None (updates `results_df` with shrunk LFCs)
+
+**Attributes:**
+- `results_df`: pandas DataFrame containing test results (available after `summary()`)
+
+---
+
+## Utility Functions
+
+### `pydeseq2.utils.load_example_data(modality="single-factor")`
+
+Load synthetic example datasets for testing and tutorials.
+
+**Parameters:**
+- `modality`: str, either "single-factor" or "multi-factor"
+
+**Returns:** tuple of (counts_df, metadata_df)
+- `counts_df`: pandas DataFrame with synthetic count data
+- `metadata_df`: pandas DataFrame with sample annotations
+
+---
+
+## Preprocessing Module
+
+The `pydeseq2.preprocessing` module provides utilities for data preparation.
+
+**Common operations:**
+- Gene filtering based on minimum read counts
+- Sample filtering based on metadata criteria
+- Data transformation and normalization
+
+---
+
+## Inference Classes
+
+### Inference
+Abstract base class defining the interface for DESeq2-related inference methods.
+
+### DefaultInference
+Default implementation of inference methods using scipy, sklearn, and numpy.
+
+**Purpose:** Provides the mathematical implementations for:
+- GLM (Generalized Linear Model) fitting
+- Dispersion estimation
+- Trend curve fitting
+- Statistical testing
+
+---
+
+## Data Structure Requirements
+
+### Count Matrix
+- **Shape:** (samples × genes)
+- **Type:** pandas DataFrame
+- **Values:** Non-negative integers (raw read counts)
+- **Index:** Sample identifiers (must match metadata index)
+- **Columns:** Gene identifiers
+
+### Metadata
+- **Shape:** (samples × variables)
+- **Type:** pandas DataFrame
+- **Index:** Sample identifiers (must match count matrix index)
+- **Columns:** Experimental factors (e.g., "condition", "batch", "group")
+- **Values:** Categorical or continuous variables used in the design formula
+
+### Important Notes
+- Sample order must match between counts and metadata
+- Missing values in metadata should be handled before analysis
+- Gene names should be unique
+- Count files often need transposition: `counts_df = counts_df.T`
+
+---
+
+## Common Workflow Pattern
+
+```python
+from pydeseq2.dds import DeseqDataSet
+from pydeseq2.ds import DeseqStats
+
+# 1. Initialize dataset
+dds = DeseqDataSet(
+    counts=counts_df,
+    metadata=metadata,
+    design="~condition",
+    refit_cooks=True
+)
+
+# 2. Fit dispersions and LFCs
+dds.deseq2()
+
+# 3. Perform statistical testing
+ds = DeseqStats(
+    dds,
+    contrast=["condition", "treated", "control"],
+    alpha=0.05
+)
+ds.summary()
+
+# 4. Optional: Shrink LFCs for visualization
+ds.lfc_shrink()
+
+# 5. Access results
+results = ds.results_df
+```
+
+---
+
+## Version Compatibility
+
+PyDESeq2 aims to match the default settings of DESeq2 v1.34.0. Some differences may exist as it is a from-scratch reimplementation in Python.
+
+**Tested with:**
+- Python 3.10-3.11
+- anndata 0.8.0+
+- numpy 1.23.0+
+- pandas 1.4.3+
+- scikit-learn 1.1.1+
+- scipy 1.11.0+
--- a/skills/pydeseq2/references/workflow_guide.md
+++ b/skills/pydeseq2/references/workflow_guide.md
@@ -0,0 +1,582 @@
+# PyDESeq2 Workflow Guide
+
+This document provides detailed step-by-step workflows for common PyDESeq2 analysis patterns.
+
+## Table of Contents
+1. [Complete Differential Expression Analysis](#complete-differential-expression-analysis)
+2. [Data Loading and Preparation](#data-loading-and-preparation)
+3. [Single-Factor Analysis](#single-factor-analysis)
+4. [Multi-Factor Analysis](#multi-factor-analysis)
+5. [Result Export and Visualization](#result-export-and-visualization)
+6. [Common Patterns and Best Practices](#common-patterns-and-best-practices)
+7. [Troubleshooting](#troubleshooting)
+
+---
+
+## Complete Differential Expression Analysis
+
+### Overview
+A standard PyDESeq2 analysis consists of 12 main steps across two phases:
+
+**Phase 1: Read Counts Modeling (Steps 1-7)**
+- Normalization and dispersion estimation
+- Log fold-change fitting
+- Outlier detection
+
+**Phase 2: Statistical Analysis (Steps 8-12)**
+- Wald testing
+- Multiple testing correction
+- Optional LFC shrinkage
+
+### Full Workflow Code
+
+```python
+import pandas as pd
+from pydeseq2.dds import DeseqDataSet
+from pydeseq2.ds import DeseqStats
+
+# Load data
+counts_df = pd.read_csv("counts.csv", index_col=0).T  # Transpose if needed
+metadata = pd.read_csv("metadata.csv", index_col=0)
+
+# Filter low-count genes
+genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
+counts_df = counts_df[genes_to_keep]
+
+# Remove samples with missing metadata
+samples_to_keep = ~metadata.condition.isna()
+counts_df = counts_df.loc[samples_to_keep]
+metadata = metadata.loc[samples_to_keep]
+
+# Initialize DeseqDataSet
+dds = DeseqDataSet(
+    counts=counts_df,
+    metadata=metadata,
+    design="~condition",
+    refit_cooks=True
+)
+
+# Run normalization and fitting
+dds.deseq2()
+
+# Perform statistical testing
+ds = DeseqStats(
+    dds,
+    contrast=["condition", "treated", "control"],
+    alpha=0.05,
+    cooks_filter=True,
+    independent_filter=True
+)
+ds.summary()
+
+# Optional: Apply LFC shrinkage for visualization
+ds.lfc_shrink()
+
+# Access results
+results = ds.results_df
+print(results.head())
+```
+
+---
+
+## Data Loading and Preparation
+
+### Loading CSV Files
+
+Count data typically comes in genes × samples format but needs to be transposed:
+
+```python
+import pandas as pd
+
+# Load count matrix (genes × samples)
+counts_df = pd.read_csv("counts.csv", index_col=0)
+
+# Transpose to samples × genes
+counts_df = counts_df.T
+
+# Load metadata (already in samples × variables format)
+metadata = pd.read_csv("metadata.csv", index_col=0)
+```
+
+### Loading from Other Formats
+
+**From TSV:**
+```python
+counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
+metadata = pd.read_csv("metadata.tsv", sep="\t", index_col=0)
+```
+
+**From saved pickle:**
+```python
+import pickle
+
+with open("counts.pkl", "rb") as f:
+    counts_df = pickle.load(f)
+
+with open("metadata.pkl", "rb") as f:
+    metadata = pickle.load(f)
+```
+
+**From AnnData:**
+```python
+import anndata as ad
+
+adata = ad.read_h5ad("data.h5ad")
+counts_df = pd.DataFrame(
+    adata.X,
+    index=adata.obs_names,
+    columns=adata.var_names
+)
+metadata = adata.obs
+```
+
+### Data Filtering
+
+**Filter genes with low counts:**
+```python
+# Remove genes with fewer than 10 total reads
+genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
+counts_df = counts_df[genes_to_keep]
+```
+
+**Filter samples with missing metadata:**
+```python
+# Remove samples where 'condition' column is NA
+samples_to_keep = ~metadata.condition.isna()
+counts_df = counts_df.loc[samples_to_keep]
+metadata = metadata.loc[samples_to_keep]
+```
+
+**Filter by multiple criteria:**
+```python
+# Keep only samples that meet all criteria
+mask = (
+    ~metadata.condition.isna() &
+    (metadata.batch.isin(["batch1", "batch2"])) &
+    (metadata.age >= 18)
+)
+counts_df = counts_df.loc[mask]
+metadata = metadata.loc[mask]
+```
+
+### Data Validation
+
+**Check data structure:**
+```python
+print(f"Counts shape: {counts_df.shape}")  # Should be (samples, genes)
+print(f"Metadata shape: {metadata.shape}")  # Should be (samples, variables)
+print(f"Indices match: {all(counts_df.index == metadata.index)}")
+
+# Check for negative values
+assert (counts_df >= 0).all().all(), "Counts must be non-negative"
+
+# Check for non-integer values
+assert counts_df.applymap(lambda x: x == int(x)).all().all(), "Counts must be integers"
+```
+
+---
+
+## Single-Factor Analysis
+
+### Simple Two-Group Comparison
+
+Compare treated vs control samples:
+
+```python
+from pydeseq2.dds import DeseqDataSet
+from pydeseq2.ds import DeseqStats
+
+# Design: model expression as a function of condition
+dds = DeseqDataSet(
+    counts=counts_df,
+    metadata=metadata,
+    design="~condition"
+)
+
+dds.deseq2()
+
+# Test treated vs control
+ds = DeseqStats(
+    dds,
+    contrast=["condition", "treated", "control"]
+)
+ds.summary()
+
+# Results
+results = ds.results_df
+significant = results[results.padj < 0.05]
+print(f"Found {len(significant)} significant genes")
+```
+
+### Multiple Pairwise Comparisons
+
+When comparing multiple groups:
+
+```python
+# Test each treatment vs control
+treatments = ["treated_A", "treated_B", "treated_C"]
+all_results = {}
+
+for treatment in treatments:
+    ds = DeseqStats(
+        dds,
+        contrast=["condition", treatment, "control"]
+    )
+    ds.summary()
+    all_results[treatment] = ds.results_df
+
+# Compare results across treatments
+for name, results in all_results.items():
+    sig = results[results.padj < 0.05]
+    print(f"{name}: {len(sig)} significant genes")
+```
+
+---
+
+## Multi-Factor Analysis
+
+### Two-Factor Design
+
+Account for batch effects while testing condition:
+
+```python
+# Design includes both batch and condition
+dds = DeseqDataSet(
+    counts=counts_df,
+    metadata=metadata,
+    design="~batch + condition"
+)
+
+dds.deseq2()
+
+# Test condition effect while controlling for batch
+ds = DeseqStats(
+    dds,
+    contrast=["condition", "treated", "control"]
+)
+ds.summary()
+```
+
+### Interaction Effects
+
+Test whether treatment effect differs between groups:
+
+```python
+# Design includes interaction term
+dds = DeseqDataSet(
+    counts=counts_df,
+    metadata=metadata,
+    design="~group + condition + group:condition"
+)
+
+dds.deseq2()
+
+# Test the interaction term
+ds = DeseqStats(dds, contrast=["group:condition", ...])
+ds.summary()
+```
+
+### Continuous Covariates
+
+Include continuous variables like age:
+
+```python
+# Ensure age is numeric in metadata
+metadata["age"] = pd.to_numeric(metadata["age"])
+
+dds = DeseqDataSet(
+    counts=counts_df,
+    metadata=metadata,
+    design="~age + condition"
+)
+
+dds.deseq2()
+```
+
+---
+
+## Result Export and Visualization
+
+### Saving Results
+
+**Export as CSV:**
+```python
+# Save statistical results
+ds.results_df.to_csv("deseq2_results.csv")
+
+# Save significant genes only
+significant = ds.results_df[ds.results_df.padj < 0.05]
+significant.to_csv("significant_genes.csv")
+
+# Save with sorted results
+sorted_results = ds.results_df.sort_values("padj")
+sorted_results.to_csv("sorted_results.csv")
+```
+
+**Save DeseqDataSet:**
+```python
+import pickle
+
+# Save as AnnData for later use
+with open("dds_result.pkl", "wb") as f:
+    pickle.dump(dds.to_picklable_anndata(), f)
+```
+
+**Load saved results:**
+```python
+# Load results
+results = pd.read_csv("deseq2_results.csv", index_col=0)
+
+# Load AnnData
+with open("dds_result.pkl", "rb") as f:
+    adata = pickle.load(f)
+```
+
+### Basic Visualization
+
+**Volcano plot:**
+```python
+import matplotlib.pyplot as plt
+import numpy as np
+
+results = ds.results_df.copy()
+results["-log10(padj)"] = -np.log10(results.padj)
+
+# Plot
+plt.figure(figsize=(10, 6))
+plt.scatter(
+    results.log2FoldChange,
+    results["-log10(padj)"],
+    alpha=0.5,
+    s=10
+)
+plt.axhline(-np.log10(0.05), color='red', linestyle='--', label='padj=0.05')
+plt.axvline(1, color='gray', linestyle='--')
+plt.axvline(-1, color='gray', linestyle='--')
+plt.xlabel("Log2 Fold Change")
+plt.ylabel("-Log10(Adjusted P-value)")
+plt.title("Volcano Plot")
+plt.legend()
+plt.savefig("volcano_plot.png", dpi=300)
+```
+
+**MA plot:**
+```python
+plt.figure(figsize=(10, 6))
+plt.scatter(
+    np.log10(results.baseMean + 1),
+    results.log2FoldChange,
+    alpha=0.5,
+    s=10,
+    c=(results.padj < 0.05),
+    cmap='bwr'
+)
+plt.xlabel("Log10(Base Mean + 1)")
+plt.ylabel("Log2 Fold Change")
+plt.title("MA Plot")
+plt.savefig("ma_plot.png", dpi=300)
+```
+
+---
+
+## Common Patterns and Best Practices
+
+### 1. Data Preprocessing Checklist
+
+Before running PyDESeq2:
+- ✓ Ensure counts are non-negative integers
+- ✓ Verify samples × genes orientation
+- ✓ Check that sample names match between counts and metadata
+- ✓ Remove or handle missing metadata values
+- ✓ Filter low-count genes (typically < 10 total reads)
+- ✓ Verify experimental factors are properly encoded
+
+### 2. Design Formula Best Practices
+
+**Order matters:** Put adjustment variables before the variable of interest
+```python
+# Correct: control for batch, test condition
+design = "~batch + condition"
+
+# Less ideal: condition listed first
+design = "~condition + batch"
+```
+
+**Use categorical for discrete variables:**
+```python
+# Ensure proper data types
+metadata["condition"] = metadata["condition"].astype("category")
+metadata["batch"] = metadata["batch"].astype("category")
+```
+
+### 3. Statistical Testing Guidelines
+
+**Set appropriate alpha:**
+```python
+# Standard significance threshold
+ds = DeseqStats(dds, alpha=0.05)
+
+# More stringent for exploratory analysis
+ds = DeseqStats(dds, alpha=0.01)
+```
+
+**Use independent filtering:**
+```python
+# Recommended: filter low-power tests
+ds = DeseqStats(dds, independent_filter=True)
+
+# Only disable if you have specific reasons
+ds = DeseqStats(dds, independent_filter=False)
+```
+
+### 4. LFC Shrinkage
+
+**When to use:**
+- For visualization (volcano plots, heatmaps)
+- For ranking genes by effect size
+- When prioritizing genes for follow-up
+
+**When NOT to use:**
+- For reporting statistical significance (use unshrunken p-values)
+- For gene set enrichment analysis (typically uses unshrunken values)
+
+```python
+# Save both versions
+ds.results_df.to_csv("results_unshrunken.csv")
+ds.lfc_shrink()
+ds.results_df.to_csv("results_shrunken.csv")
+```
+
+### 5. Memory Management
+
+For large datasets:
+```python
+# Use parallel processing
+dds = DeseqDataSet(
+    counts=counts_df,
+    metadata=metadata,
+    design="~condition",
+    n_cpus=4  # Adjust based on available cores
+)
+
+# Process in batches if needed
+# (split genes into chunks, analyze separately, combine results)
+```
+
+---
+
+## Troubleshooting
+
+### Error: Index mismatch between counts and metadata
+
+**Problem:** Sample names don't match
+```
+KeyError: Sample names in counts and metadata don't match
+```
+
+**Solution:**
+```python
+# Check indices
+print("Counts samples:", counts_df.index.tolist())
+print("Metadata samples:", metadata.index.tolist())
+
+# Align if needed
+common_samples = counts_df.index.intersection(metadata.index)
+counts_df = counts_df.loc[common_samples]
+metadata = metadata.loc[common_samples]
+```
+
+### Error: All genes have zero counts
+
+**Problem:** Data might need transposition
+```
+ValueError: All genes have zero total counts
+```
+
+**Solution:**
+```python
+# Check data orientation
+print(f"Counts shape: {counts_df.shape}")
+
+# If genes > samples, likely needs transpose
+if counts_df.shape[1] < counts_df.shape[0]:
+    counts_df = counts_df.T
+```
+
+### Warning: Many genes filtered out
+
+**Problem:** Too many low-count genes removed
+
+**Check:**
+```python
+# See distribution of gene counts
+print(counts_df.sum(axis=0).describe())
+
+# Visualize
+import matplotlib.pyplot as plt
+plt.hist(counts_df.sum(axis=0), bins=50, log=True)
+plt.xlabel("Total counts per gene")
+plt.ylabel("Frequency")
+plt.show()
+```
+
+**Adjust filtering if needed:**
+```python
+# Try lower threshold
+genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 5]
+```
+
+### Error: Design matrix is not full rank
+
+**Problem:** Confounded design (e.g., all treated samples in one batch)
+
+**Solution:**
+```python
+# Check design confounding
+print(pd.crosstab(metadata.condition, metadata.batch))
+
+# Either remove confounded variable or add interaction term
+design = "~condition"  # Drop batch
+# OR
+design = "~condition + batch + condition:batch"  # Add interaction
+```
+
+### Issue: No significant genes found
+
+**Possible causes:**
+1. Small effect sizes
+2. High biological variability
+3. Insufficient sample size
+4. Technical issues (batch effects, outliers)
+
+**Diagnostics:**
+```python
+# Check dispersion estimates
+import matplotlib.pyplot as plt
+dispersions = dds.varm["dispersions"]
+plt.hist(dispersions, bins=50)
+plt.xlabel("Dispersion")
+plt.ylabel("Frequency")
+plt.show()
+
+# Check size factors (should be close to 1)
+print("Size factors:", dds.obsm["size_factors"])
+
+# Look at top genes even if not significant
+top_genes = ds.results_df.nsmallest(20, "pvalue")
+print(top_genes)
+```
+
+### Memory errors on large datasets
+
+**Solutions:**
+```python
+# 1. Use fewer CPUs (paradoxically can help)
+dds = DeseqDataSet(..., n_cpus=1)
+
+# 2. Filter more aggressively
+genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 20]
+
+# 3. Process in batches
+# Split analysis by gene subsets and combine results
+```