Initial commit
This commit is contained in:
151
skills/arboreto/references/basic_inference.md
Normal file
151
skills/arboreto/references/basic_inference.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# Basic GRN Inference with Arboreto
|
||||
|
||||
## Input Data Requirements
|
||||
|
||||
Arboreto requires gene expression data in one of two formats:
|
||||
|
||||
### Pandas DataFrame (Recommended)
|
||||
- **Rows**: Observations (cells, samples, conditions)
|
||||
- **Columns**: Genes (with gene names as column headers)
|
||||
- **Format**: Numeric expression values
|
||||
|
||||
Example:
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Load expression matrix with genes as columns
|
||||
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
|
||||
# Columns: ['gene1', 'gene2', 'gene3', ...]
|
||||
# Rows: observation data
|
||||
```
|
||||
|
||||
### NumPy Array
|
||||
- **Shape**: (observations, genes)
|
||||
- **Requirement**: Separately provide gene names list matching column order
|
||||
|
||||
Example:
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
expression_matrix = np.genfromtxt('expression_data.tsv', delimiter='\t', skip_header=1)
|
||||
with open('expression_data.tsv') as f:
|
||||
gene_names = [gene.strip() for gene in f.readline().split('\t')]
|
||||
|
||||
assert expression_matrix.shape[1] == len(gene_names)
|
||||
```
|
||||
|
||||
## Transcription Factors (TFs)
|
||||
|
||||
Optionally provide a list of transcription factor names to restrict regulatory inference:
|
||||
|
||||
```python
|
||||
from arboreto.utils import load_tf_names
|
||||
|
||||
# Load from file (one TF per line)
|
||||
tf_names = load_tf_names('transcription_factors.txt')
|
||||
|
||||
# Or define directly
|
||||
tf_names = ['TF1', 'TF2', 'TF3']
|
||||
```
|
||||
|
||||
If not provided, all genes are considered potential regulators.
|
||||
|
||||
## Basic Inference Workflow
|
||||
|
||||
### Using Pandas DataFrame
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from arboreto.utils import load_tf_names
|
||||
from arboreto.algo import grnboost2
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Load expression data
|
||||
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
|
||||
|
||||
# Load transcription factors (optional)
|
||||
tf_names = load_tf_names('tf_list.txt')
|
||||
|
||||
# Run GRN inference
|
||||
network = grnboost2(
|
||||
expression_data=expression_matrix,
|
||||
tf_names=tf_names # Optional
|
||||
)
|
||||
|
||||
# Save results
|
||||
network.to_csv('network_output.tsv', sep='\t', index=False, header=False)
|
||||
```
|
||||
|
||||
**Critical**: The `if __name__ == '__main__':` guard is required because Dask spawns new processes internally.
|
||||
|
||||
### Using NumPy Array
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from arboreto.algo import grnboost2
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Load expression matrix
|
||||
expression_matrix = np.genfromtxt('expression_data.tsv', delimiter='\t', skip_header=1)
|
||||
|
||||
# Extract gene names from header
|
||||
with open('expression_data.tsv') as f:
|
||||
gene_names = [gene.strip() for gene in f.readline().split('\t')]
|
||||
|
||||
# Verify dimensions match
|
||||
assert expression_matrix.shape[1] == len(gene_names)
|
||||
|
||||
# Run inference with explicit gene names
|
||||
network = grnboost2(
|
||||
expression_data=expression_matrix,
|
||||
gene_names=gene_names,
|
||||
tf_names=tf_names
|
||||
)
|
||||
|
||||
network.to_csv('network_output.tsv', sep='\t', index=False, header=False)
|
||||
```
|
||||
|
||||
## Output Format
|
||||
|
||||
Arboreto returns a Pandas DataFrame with three columns:
|
||||
|
||||
| Column | Description |
|
||||
|--------|-------------|
|
||||
| `TF` | Transcription factor (regulator) gene name |
|
||||
| `target` | Target gene name |
|
||||
| `importance` | Regulatory importance score (higher = stronger regulation) |
|
||||
|
||||
Example output:
|
||||
```
|
||||
TF1 gene5 0.856
|
||||
TF2 gene12 0.743
|
||||
TF1 gene8 0.621
|
||||
```
|
||||
|
||||
## Setting Random Seed
|
||||
|
||||
For reproducible results, provide a seed parameter:
|
||||
|
||||
```python
|
||||
network = grnboost2(
|
||||
expression_data=expression_matrix,
|
||||
tf_names=tf_names,
|
||||
seed=777
|
||||
)
|
||||
```
|
||||
|
||||
## Algorithm Selection
|
||||
|
||||
Use `grnboost2()` for most cases (faster, handles large datasets):
|
||||
```python
|
||||
from arboreto.algo import grnboost2
|
||||
network = grnboost2(expression_data=expression_matrix)
|
||||
```
|
||||
|
||||
Use `genie3()` for comparison or specific requirements:
|
||||
```python
|
||||
from arboreto.algo import genie3
|
||||
network = genie3(expression_data=expression_matrix)
|
||||
```
|
||||
|
||||
See `references/algorithms.md` for detailed algorithm comparison.
|
||||
Reference in New Issue
Block a user