139 lines
4.3 KiB
Markdown
139 lines
4.3 KiB
Markdown
# GRN Inference Algorithms
|
|
|
|
Arboreto provides two algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach.
|
|
|
|
## Algorithm Overview
|
|
|
|
Both algorithms follow the same inference strategy:
|
|
1. For each target gene in the dataset, train a regression model
|
|
2. Identify the most important features (potential regulators) from the model
|
|
3. Emit these features as candidate regulators with importance scores
|
|
|
|
The key difference is **computational efficiency** and the underlying regression method.
|
|
|
|
## GRNBoost2 (Recommended)
|
|
|
|
**Purpose**: Fast GRN inference for large-scale datasets using gradient boosting.
|
|
|
|
### When to Use
|
|
- **Large datasets**: Tens of thousands of observations (e.g., single-cell RNA-seq)
|
|
- **Time-constrained analysis**: Need faster results than GENIE3
|
|
- **Default choice**: GRNBoost2 is the flagship algorithm and recommended for most use cases
|
|
|
|
### Technical Details
|
|
- **Method**: Stochastic gradient boosting with early-stopping regularization
|
|
- **Performance**: Significantly faster than GENIE3 on large datasets
|
|
- **Output**: Same format as GENIE3 (TF-target-importance triplets)
|
|
|
|
### Usage
|
|
```python
|
|
from arboreto.algo import grnboost2
|
|
|
|
network = grnboost2(
|
|
expression_data=expression_matrix,
|
|
tf_names=tf_names,
|
|
seed=42 # For reproducibility
|
|
)
|
|
```
|
|
|
|
### Parameters
|
|
```python
|
|
grnboost2(
|
|
expression_data, # Required: pandas DataFrame or numpy array
|
|
gene_names=None, # Required for numpy arrays
|
|
tf_names='all', # List of TF names or 'all'
|
|
verbose=False, # Print progress messages
|
|
client_or_address='local', # Dask client or scheduler address
|
|
seed=None # Random seed for reproducibility
|
|
)
|
|
```
|
|
|
|
## GENIE3
|
|
|
|
**Purpose**: Classic Random Forest-based GRN inference, serving as the conceptual blueprint.
|
|
|
|
### When to Use
|
|
- **Smaller datasets**: When dataset size allows for longer computation
|
|
- **Comparison studies**: When comparing with published GENIE3 results
|
|
- **Validation**: To validate GRNBoost2 results
|
|
|
|
### Technical Details
|
|
- **Method**: Random Forest or ExtraTrees regression
|
|
- **Foundation**: Original multiple regression GRN inference strategy
|
|
- **Trade-off**: More computationally expensive but well-established
|
|
|
|
### Usage
|
|
```python
|
|
from arboreto.algo import genie3
|
|
|
|
network = genie3(
|
|
expression_data=expression_matrix,
|
|
tf_names=tf_names,
|
|
seed=42
|
|
)
|
|
```
|
|
|
|
### Parameters
|
|
```python
|
|
genie3(
|
|
expression_data, # Required: pandas DataFrame or numpy array
|
|
gene_names=None, # Required for numpy arrays
|
|
tf_names='all', # List of TF names or 'all'
|
|
verbose=False, # Print progress messages
|
|
client_or_address='local', # Dask client or scheduler address
|
|
seed=None # Random seed for reproducibility
|
|
)
|
|
```
|
|
|
|
## Algorithm Comparison
|
|
|
|
| Feature | GRNBoost2 | GENIE3 |
|
|
|---------|-----------|--------|
|
|
| **Speed** | Fast (optimized for large data) | Slower |
|
|
| **Method** | Gradient boosting | Random Forest |
|
|
| **Best for** | Large-scale data (10k+ observations) | Small-medium datasets |
|
|
| **Output format** | Same | Same |
|
|
| **Inference strategy** | Multiple regression | Multiple regression |
|
|
| **Recommended** | Yes (default choice) | For comparison/validation |
|
|
|
|
## Advanced: Custom Regressor Parameters
|
|
|
|
For advanced users, pass custom scikit-learn regressor parameters:
|
|
|
|
```python
|
|
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
|
|
|
|
# Custom GRNBoost2 parameters
|
|
custom_grnboost2 = grnboost2(
|
|
expression_data=expression_matrix,
|
|
regressor_type='GBM',
|
|
regressor_kwargs={
|
|
'n_estimators': 100,
|
|
'max_depth': 5,
|
|
'learning_rate': 0.1
|
|
}
|
|
)
|
|
|
|
# Custom GENIE3 parameters
|
|
custom_genie3 = genie3(
|
|
expression_data=expression_matrix,
|
|
regressor_type='RF',
|
|
regressor_kwargs={
|
|
'n_estimators': 1000,
|
|
'max_features': 'sqrt'
|
|
}
|
|
)
|
|
```
|
|
|
|
## Choosing the Right Algorithm
|
|
|
|
**Decision guide**:
|
|
|
|
1. **Start with GRNBoost2** - It's faster and handles large datasets better
|
|
2. **Use GENIE3 if**:
|
|
- Comparing with existing GENIE3 publications
|
|
- Dataset is small-medium sized
|
|
- Validating GRNBoost2 results
|
|
|
|
Both algorithms produce comparable regulatory networks with the same output format, making them interchangeable for most analyses.
|