6.6 KiB
Distributed Computing with Arboreto
Arboreto leverages Dask for parallelized computation, enabling efficient GRN inference from single-machine multi-core processing to multi-node cluster environments.
Computation Architecture
GRN inference is inherently parallelizable:
- Each target gene's regression model can be trained independently
- Arboreto represents computation as a Dask task graph
- Tasks are distributed across available computational resources
Local Multi-Core Processing (Default)
By default, arboreto uses all available CPU cores on the local machine:
from arboreto.algo import grnboost2
# Automatically uses all local cores
network = grnboost2(expression_data=expression_matrix, tf_names=tf_names)
This is sufficient for most use cases and requires no additional configuration.
Custom Local Dask Client
For fine-grained control over local resources, create a custom Dask client:
from distributed import LocalCluster, Client
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Configure local cluster
local_cluster = LocalCluster(
n_workers=10, # Number of worker processes
threads_per_worker=1, # Threads per worker
memory_limit='8GB' # Memory limit per worker
)
# Create client
custom_client = Client(local_cluster)
# Run inference with custom client
network = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names,
client_or_address=custom_client
)
# Clean up
custom_client.close()
local_cluster.close()
Benefits of Custom Client
- Resource control: Limit CPU and memory usage
- Multiple runs: Reuse same client for different parameter sets
- Monitoring: Access Dask dashboard for performance insights
Multiple Inference Runs with Same Client
Reuse a single Dask client for multiple inference runs with different parameters:
from distributed import LocalCluster, Client
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Initialize client once
local_cluster = LocalCluster(n_workers=8, threads_per_worker=1)
client = Client(local_cluster)
# Run multiple inferences
network_seed1 = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names,
client_or_address=client,
seed=666
)
network_seed2 = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names,
client_or_address=client,
seed=777
)
# Different algorithms with same client
from arboreto.algo import genie3
network_genie3 = genie3(
expression_data=expression_matrix,
tf_names=tf_names,
client_or_address=client
)
# Clean up once
client.close()
local_cluster.close()
Distributed Cluster Computing
For very large datasets, connect to a remote Dask distributed scheduler running on a cluster:
Step 1: Set Up Dask Scheduler (on cluster head node)
dask-scheduler
# Output: Scheduler at tcp://10.118.224.134:8786
Step 2: Start Dask Workers (on cluster compute nodes)
dask-worker tcp://10.118.224.134:8786
Step 3: Connect from Client
from distributed import Client
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Connect to remote scheduler
scheduler_address = 'tcp://10.118.224.134:8786'
cluster_client = Client(scheduler_address)
# Run inference on cluster
network = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names,
client_or_address=cluster_client
)
cluster_client.close()
Cluster Configuration Best Practices
Worker configuration:
dask-worker tcp://scheduler:8786 \
--nprocs 4 \ # Number of processes per node
--nthreads 1 \ # Threads per process
--memory-limit 16GB # Memory per process
For large-scale inference:
- Use more workers with moderate memory rather than fewer workers with large memory
- Set
threads_per_worker=1to avoid GIL contention in scikit-learn - Monitor memory usage to prevent workers from being killed
Monitoring and Debugging
Dask Dashboard
Access the Dask dashboard for real-time monitoring:
from distributed import Client
client = Client() # Prints dashboard URL
# Dashboard available at: http://localhost:8787/status
The dashboard shows:
- Task progress: Number of tasks completed/pending
- Resource usage: CPU, memory per worker
- Task stream: Real-time visualization of computation
- Performance: Bottleneck identification
Verbose Output
Enable verbose logging to track inference progress:
network = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names,
verbose=True
)
Performance Optimization Tips
1. Data Format
- Use Pandas DataFrame when possible: More efficient than NumPy for Dask operations
- Reduce data size: Filter low-variance genes before inference
2. Worker Configuration
- CPU-bound tasks: Set
threads_per_worker=1, increasen_workers - Memory-bound tasks: Increase
memory_limitper worker
3. Cluster Setup
- Network: Ensure high-bandwidth, low-latency network between nodes
- Storage: Use shared filesystem or object storage for large datasets
- Scheduling: Allocate dedicated nodes to avoid resource contention
4. Transcription Factor Filtering
- Limit TF list: Providing specific TF names reduces computation
# Full search (slow)
network = grnboost2(expression_data=matrix)
# Filtered search (faster)
network = grnboost2(expression_data=matrix, tf_names=known_tfs)
Example: Large-Scale Single-Cell Analysis
Complete workflow for processing single-cell RNA-seq data on a cluster:
from distributed import Client
from arboreto.algo import grnboost2
import pandas as pd
if __name__ == '__main__':
# Connect to cluster
client = Client('tcp://cluster-scheduler:8786')
# Load large single-cell dataset (50,000 cells x 20,000 genes)
expression_data = pd.read_csv('scrnaseq_data.tsv', sep='\t')
# Load cell-type-specific TFs
tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()
# Run distributed inference
network = grnboost2(
expression_data=expression_data,
tf_names=tf_names,
client_or_address=client,
verbose=True,
seed=42
)
# Save results
network.to_csv('grn_results.tsv', sep='\t', index=False)
client.close()
This approach enables analysis of datasets that would be impractical on a single machine.