zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

11 KiB

Raw Blame History

Dask Schedulers

Overview

Dask provides multiple task schedulers, each suited to different workloads. The scheduler determines how tasks are executed: sequentially, in parallel threads, in parallel processes, or distributed across a cluster.

Scheduler Types

Single-Machine Schedulers

1. Local Threads (Default)

Description: The threaded scheduler executes computations with a local concurrent.futures.ThreadPoolExecutor.

When to Use:

Numeric computations in NumPy, Pandas, scikit-learn
Libraries that release the GIL (Global Interpreter Lock)
Operations benefit from shared memory access
Default for Dask Arrays and DataFrames

Characteristics:

Low overhead
Shared memory between threads
Best for GIL-releasing operations
Poor for pure Python code (GIL contention)

Example:

import dask.array as da

# Uses threads by default
x = da.random.random((10000, 10000), chunks=(1000, 1000))
result = x.mean().compute()  # Computed with threads

Explicit Configuration:

import dask

# Set globally
dask.config.set(scheduler='threads')

# Or per-compute
result = x.mean().compute(scheduler='threads')

2. Local Processes

Description: Multiprocessing scheduler that uses concurrent.futures.ProcessPoolExecutor.

When to Use:

Pure Python code with GIL contention
Text processing and Python collections
Operations that benefit from process isolation
CPU-bound Python code

Characteristics:

Bypasses GIL limitations
Incurs data transfer costs between processes
Higher overhead than threads
Ideal for linear workflows with small inputs/outputs

Example:

import dask.bag as db

# Good for Python object processing
bag = db.read_text('data/*.txt')
result = bag.map(complex_python_function).compute(scheduler='processes')

Explicit Configuration:

import dask

# Set globally
dask.config.set(scheduler='processes')

# Or per-compute
result = computation.compute(scheduler='processes')

Limitations:

Data must be serializable (pickle)
Overhead from process creation
Memory overhead from data copying

3. Single Thread (Synchronous)

Description: The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all.

When to Use:

Debugging with pdb
Profiling with standard Python tools
Understanding errors in detail
Development and testing

Characteristics:

No parallelism
Easy debugging
No overhead
Deterministic execution

Example:

import dask

# Enable for debugging
dask.config.set(scheduler='synchronous')

# Now can use pdb
result = computation.compute()  # Runs in single thread

Debugging with IPython:

# In IPython/Jupyter
%pdb on

dask.config.set(scheduler='synchronous')
result = problematic_computation.compute()  # Drops into debugger on error

Distributed Schedulers

4. Local Distributed

Description: Despite its name, this scheduler runs effectively on personal machines using the distributed scheduler infrastructure.

When to Use:

Need diagnostic dashboard
Asynchronous APIs
Better data locality handling than multiprocessing
Development before scaling to cluster
Want distributed features on single machine

Characteristics:

Provides dashboard for monitoring
Better memory management
More overhead than threads/processes
Can scale to cluster later

Example:

from dask.distributed import Client
import dask.dataframe as dd

# Create local cluster
client = Client()  # Automatically uses all cores

# Use distributed scheduler
ddf = dd.read_csv('data.csv')
result = ddf.groupby('category').mean().compute()

# View dashboard
print(client.dashboard_link)

# Clean up
client.close()

Configuration Options:

# Control resources
client = Client(
    n_workers=4,
    threads_per_worker=2,
    memory_limit='4GB'
)

5. Cluster Distributed

Description: For scaling across multiple machines using the distributed scheduler.

When to Use:

Data exceeds single machine capacity
Need computational power beyond one machine
Production deployments
Cluster computing environments (HPC, cloud)

Characteristics:

Scales to hundreds of machines
Requires cluster setup
Network communication overhead
Advanced features (adaptive scaling, task prioritization)

Example with Dask-Jobqueue (HPC):

from dask_jobqueue import SLURMCluster
from dask.distributed import Client

# Create cluster on HPC with SLURM
cluster = SLURMCluster(
    cores=24,
    memory='100GB',
    walltime='02:00:00',
    queue='regular'
)

# Scale to 10 jobs
cluster.scale(jobs=10)

# Connect client
client = Client(cluster)

# Run computation
result = computation.compute()

client.close()

Example with Dask on Kubernetes:

from dask_kubernetes import KubeCluster
from dask.distributed import Client

cluster = KubeCluster()
cluster.scale(20)  # 20 workers

client = Client(cluster)
result = computation.compute()

client.close()

Scheduler Configuration

Global Configuration

import dask

# Set scheduler globally for session
dask.config.set(scheduler='threads')
dask.config.set(scheduler='processes')
dask.config.set(scheduler='synchronous')

Context Manager

import dask

# Temporarily use different scheduler
with dask.config.set(scheduler='processes'):
    result = computation.compute()

# Back to default scheduler
result2 = computation2.compute()

Per-Compute

# Specify scheduler per compute call
result = computation.compute(scheduler='threads')
result = computation.compute(scheduler='processes')
result = computation.compute(scheduler='synchronous')

Distributed Client

from dask.distributed import Client

# Using client automatically sets distributed scheduler
client = Client()

# All computations use distributed scheduler
result = computation.compute()

client.close()

Choosing the Right Scheduler

Decision Matrix

Workload Type	Recommended Scheduler	Rationale
NumPy/Pandas operations	Threads (default)	GIL-releasing, shared memory
Pure Python objects	Processes	Avoids GIL contention
Text/log processing	Processes	Python-heavy operations
Debugging	Synchronous	Easy debugging, deterministic
Need dashboard	Local Distributed	Monitoring and diagnostics
Multi-machine	Cluster Distributed	Exceeds single machine capacity
Small data, quick tasks	Threads	Lowest overhead
Large data, single machine	Local Distributed	Better memory management

Performance Considerations

Threads:

Overhead: ~10 µs per task
Best for: Numeric operations
Memory: Shared
GIL: Affected by GIL

Processes:

Overhead: ~10 ms per task
Best for: Python operations
Memory: Copied between processes
GIL: Not affected

Synchronous:

Overhead: ~1 µs per task
Best for: Debugging
Memory: No parallelism
GIL: Not relevant

Distributed:

Overhead: ~1 ms per task
Best for: Complex workflows, monitoring
Memory: Managed by scheduler
GIL: Workers can use threads or processes

Thread Configuration for Distributed Scheduler

Setting Thread Count

from dask.distributed import Client

# Control thread/worker configuration
client = Client(
    n_workers=4,           # Number of worker processes
    threads_per_worker=2   # Threads per worker process
)

Recommended Configuration

For Numeric Workloads:

Aim for roughly 4 threads per process
Balance between parallelism and overhead
Example: 8 cores → 2 workers with 4 threads each

For Python Workloads:

Use more workers with fewer threads
Example: 8 cores → 8 workers with 1 thread each

Environment Variables

# Set thread count via environment
export DASK_NUM_WORKERS=4
export DASK_THREADS_PER_WORKER=2

# Or via config file

Common Patterns

Development to Production

# Development: Use local distributed for testing
from dask.distributed import Client
client = Client(processes=False)  # In-process for debugging

# Production: Scale to cluster
from dask.distributed import Client
client = Client('scheduler-address:8786')

Mixed Workloads

import dask
import dask.dataframe as dd

# Use threads for DataFrame operations
ddf = dd.read_parquet('data.parquet')
result1 = ddf.mean().compute(scheduler='threads')

# Use processes for Python code
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(parse_log).compute(scheduler='processes')

Debugging Workflow

import dask

# Step 1: Debug with synchronous scheduler
dask.config.set(scheduler='synchronous')
result = problematic_computation.compute()

# Step 2: Test with threads
dask.config.set(scheduler='threads')
result = computation.compute()

# Step 3: Scale with distributed
from dask.distributed import Client
client = Client()
result = computation.compute()

Monitoring and Diagnostics

Dashboard Access (Distributed Only)

from dask.distributed import Client

client = Client()

# Get dashboard URL
print(client.dashboard_link)
# Opens dashboard in browser showing:
# - Task progress
# - Worker status
# - Memory usage
# - Task stream
# - Resource utilization

Performance Profiling

# Profile computation
from dask.distributed import Client

client = Client()
result = computation.compute()

# Get performance report
client.profile(filename='profile.html')

Resource Monitoring

# Check worker info
client.scheduler_info()

# Get current tasks
client.who_has()

# Memory usage
client.run(lambda: psutil.virtual_memory().percent)

Advanced Configuration

Custom Executors

from concurrent.futures import ThreadPoolExecutor
import dask

# Use custom thread pool
with ThreadPoolExecutor(max_workers=4) as executor:
    dask.config.set(pool=executor)
    result = computation.compute(scheduler='threads')

Adaptive Scaling (Distributed)

from dask.distributed import Client

client = Client()

# Enable adaptive scaling
client.cluster.adapt(minimum=2, maximum=10)

# Cluster scales based on workload
result = computation.compute()

Worker Plugins

from dask.distributed import Client, WorkerPlugin

class CustomPlugin(WorkerPlugin):
    def setup(self, worker):
        # Initialize worker-specific resources
        worker.custom_resource = initialize_resource()

client = Client()
client.register_worker_plugin(CustomPlugin())

Troubleshooting

Slow Performance with Threads

Problem: Pure Python code slow with threaded scheduler Solution: Switch to processes or distributed scheduler

Memory Errors with Processes

Problem: Data too large to pickle/copy between processes Solution: Use threaded or distributed scheduler

Debugging Difficult

Problem: Can't use pdb with parallel schedulers Solution: Use synchronous scheduler for debugging

Task Overhead High

Problem: Many tiny tasks causing overhead Solution: Use threaded scheduler (lowest overhead) or increase chunk sizes

11 KiB Raw Blame History

Dask Schedulers

Overview

Scheduler Types

Single-Machine Schedulers

1. Local Threads (Default)

2. Local Processes

3. Single Thread (Synchronous)

Distributed Schedulers

4. Local Distributed

5. Cluster Distributed

Scheduler Configuration

Global Configuration

Context Manager

Per-Compute

Distributed Client

Choosing the Right Scheduler

Decision Matrix

Performance Considerations

Thread Configuration for Distributed Scheduler

Setting Thread Count

Recommended Configuration

Environment Variables

Common Patterns

Development to Production

Mixed Workloads

Debugging Workflow

Monitoring and Diagnostics

Dashboard Access (Distributed Only)

Performance Profiling

Resource Monitoring

Advanced Configuration

Custom Executors

Adaptive Scaling (Distributed)

Worker Plugins

Troubleshooting

Slow Performance with Threads

Memory Errors with Processes

Debugging Difficult

Task Overhead High

11 KiB

Raw Blame History