Files
gh-k-dense-ai-claude-scient…/skills/arboreto/references/algorithms.md
2025-11-30 08:30:10 +08:00

4.3 KiB

GRN Inference Algorithms

Arboreto provides two algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach.

Algorithm Overview

Both algorithms follow the same inference strategy:

  1. For each target gene in the dataset, train a regression model
  2. Identify the most important features (potential regulators) from the model
  3. Emit these features as candidate regulators with importance scores

The key difference is computational efficiency and the underlying regression method.

Purpose: Fast GRN inference for large-scale datasets using gradient boosting.

When to Use

  • Large datasets: Tens of thousands of observations (e.g., single-cell RNA-seq)
  • Time-constrained analysis: Need faster results than GENIE3
  • Default choice: GRNBoost2 is the flagship algorithm and recommended for most use cases

Technical Details

  • Method: Stochastic gradient boosting with early-stopping regularization
  • Performance: Significantly faster than GENIE3 on large datasets
  • Output: Same format as GENIE3 (TF-target-importance triplets)

Usage

from arboreto.algo import grnboost2

network = grnboost2(
    expression_data=expression_matrix,
    tf_names=tf_names,
    seed=42  # For reproducibility
)

Parameters

grnboost2(
    expression_data,           # Required: pandas DataFrame or numpy array
    gene_names=None,           # Required for numpy arrays
    tf_names='all',            # List of TF names or 'all'
    verbose=False,             # Print progress messages
    client_or_address='local', # Dask client or scheduler address
    seed=None                  # Random seed for reproducibility
)

GENIE3

Purpose: Classic Random Forest-based GRN inference, serving as the conceptual blueprint.

When to Use

  • Smaller datasets: When dataset size allows for longer computation
  • Comparison studies: When comparing with published GENIE3 results
  • Validation: To validate GRNBoost2 results

Technical Details

  • Method: Random Forest or ExtraTrees regression
  • Foundation: Original multiple regression GRN inference strategy
  • Trade-off: More computationally expensive but well-established

Usage

from arboreto.algo import genie3

network = genie3(
    expression_data=expression_matrix,
    tf_names=tf_names,
    seed=42
)

Parameters

genie3(
    expression_data,           # Required: pandas DataFrame or numpy array
    gene_names=None,           # Required for numpy arrays
    tf_names='all',            # List of TF names or 'all'
    verbose=False,             # Print progress messages
    client_or_address='local', # Dask client or scheduler address
    seed=None                  # Random seed for reproducibility
)

Algorithm Comparison

Feature GRNBoost2 GENIE3
Speed Fast (optimized for large data) Slower
Method Gradient boosting Random Forest
Best for Large-scale data (10k+ observations) Small-medium datasets
Output format Same Same
Inference strategy Multiple regression Multiple regression
Recommended Yes (default choice) For comparison/validation

Advanced: Custom Regressor Parameters

For advanced users, pass custom scikit-learn regressor parameters:

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Custom GRNBoost2 parameters
custom_grnboost2 = grnboost2(
    expression_data=expression_matrix,
    regressor_type='GBM',
    regressor_kwargs={
        'n_estimators': 100,
        'max_depth': 5,
        'learning_rate': 0.1
    }
)

# Custom GENIE3 parameters
custom_genie3 = genie3(
    expression_data=expression_matrix,
    regressor_type='RF',
    regressor_kwargs={
        'n_estimators': 1000,
        'max_features': 'sqrt'
    }
)

Choosing the Right Algorithm

Decision guide:

  1. Start with GRNBoost2 - It's faster and handles large datasets better
  2. Use GENIE3 if:
    • Comparing with existing GENIE3 publications
    • Dataset is small-medium sized
    • Validating GRNBoost2 results

Both algorithms produce comparable regulatory networks with the same output format, making them interchangeable for most analyses.