4.3 KiB
4.3 KiB
GRN Inference Algorithms
Arboreto provides two algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach.
Algorithm Overview
Both algorithms follow the same inference strategy:
- For each target gene in the dataset, train a regression model
- Identify the most important features (potential regulators) from the model
- Emit these features as candidate regulators with importance scores
The key difference is computational efficiency and the underlying regression method.
GRNBoost2 (Recommended)
Purpose: Fast GRN inference for large-scale datasets using gradient boosting.
When to Use
- Large datasets: Tens of thousands of observations (e.g., single-cell RNA-seq)
- Time-constrained analysis: Need faster results than GENIE3
- Default choice: GRNBoost2 is the flagship algorithm and recommended for most use cases
Technical Details
- Method: Stochastic gradient boosting with early-stopping regularization
- Performance: Significantly faster than GENIE3 on large datasets
- Output: Same format as GENIE3 (TF-target-importance triplets)
Usage
from arboreto.algo import grnboost2
network = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names,
seed=42 # For reproducibility
)
Parameters
grnboost2(
expression_data, # Required: pandas DataFrame or numpy array
gene_names=None, # Required for numpy arrays
tf_names='all', # List of TF names or 'all'
verbose=False, # Print progress messages
client_or_address='local', # Dask client or scheduler address
seed=None # Random seed for reproducibility
)
GENIE3
Purpose: Classic Random Forest-based GRN inference, serving as the conceptual blueprint.
When to Use
- Smaller datasets: When dataset size allows for longer computation
- Comparison studies: When comparing with published GENIE3 results
- Validation: To validate GRNBoost2 results
Technical Details
- Method: Random Forest or ExtraTrees regression
- Foundation: Original multiple regression GRN inference strategy
- Trade-off: More computationally expensive but well-established
Usage
from arboreto.algo import genie3
network = genie3(
expression_data=expression_matrix,
tf_names=tf_names,
seed=42
)
Parameters
genie3(
expression_data, # Required: pandas DataFrame or numpy array
gene_names=None, # Required for numpy arrays
tf_names='all', # List of TF names or 'all'
verbose=False, # Print progress messages
client_or_address='local', # Dask client or scheduler address
seed=None # Random seed for reproducibility
)
Algorithm Comparison
| Feature | GRNBoost2 | GENIE3 |
|---|---|---|
| Speed | Fast (optimized for large data) | Slower |
| Method | Gradient boosting | Random Forest |
| Best for | Large-scale data (10k+ observations) | Small-medium datasets |
| Output format | Same | Same |
| Inference strategy | Multiple regression | Multiple regression |
| Recommended | Yes (default choice) | For comparison/validation |
Advanced: Custom Regressor Parameters
For advanced users, pass custom scikit-learn regressor parameters:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
# Custom GRNBoost2 parameters
custom_grnboost2 = grnboost2(
expression_data=expression_matrix,
regressor_type='GBM',
regressor_kwargs={
'n_estimators': 100,
'max_depth': 5,
'learning_rate': 0.1
}
)
# Custom GENIE3 parameters
custom_genie3 = genie3(
expression_data=expression_matrix,
regressor_type='RF',
regressor_kwargs={
'n_estimators': 1000,
'max_features': 'sqrt'
}
)
Choosing the Right Algorithm
Decision guide:
- Start with GRNBoost2 - It's faster and handles large datasets better
- Use GENIE3 if:
- Comparing with existing GENIE3 publications
- Dataset is small-medium sized
- Validating GRNBoost2 results
Both algorithms produce comparable regulatory networks with the same output format, making them interchangeable for most analyses.