# Dask Schedulers ## Overview Dask provides multiple task schedulers, each suited to different workloads. The scheduler determines how tasks are executed: sequentially, in parallel threads, in parallel processes, or distributed across a cluster. ## Scheduler Types ### Single-Machine Schedulers #### 1. Local Threads (Default) **Description**: The threaded scheduler executes computations with a local `concurrent.futures.ThreadPoolExecutor`. **When to Use**: - Numeric computations in NumPy, Pandas, scikit-learn - Libraries that release the GIL (Global Interpreter Lock) - Operations benefit from shared memory access - Default for Dask Arrays and DataFrames **Characteristics**: - Low overhead - Shared memory between threads - Best for GIL-releasing operations - Poor for pure Python code (GIL contention) **Example**: ```python import dask.array as da # Uses threads by default x = da.random.random((10000, 10000), chunks=(1000, 1000)) result = x.mean().compute() # Computed with threads ``` **Explicit Configuration**: ```python import dask # Set globally dask.config.set(scheduler='threads') # Or per-compute result = x.mean().compute(scheduler='threads') ``` #### 2. Local Processes **Description**: Multiprocessing scheduler that uses `concurrent.futures.ProcessPoolExecutor`. **When to Use**: - Pure Python code with GIL contention - Text processing and Python collections - Operations that benefit from process isolation - CPU-bound Python code **Characteristics**: - Bypasses GIL limitations - Incurs data transfer costs between processes - Higher overhead than threads - Ideal for linear workflows with small inputs/outputs **Example**: ```python import dask.bag as db # Good for Python object processing bag = db.read_text('data/*.txt') result = bag.map(complex_python_function).compute(scheduler='processes') ``` **Explicit Configuration**: ```python import dask # Set globally dask.config.set(scheduler='processes') # Or per-compute result = computation.compute(scheduler='processes') ``` **Limitations**: - Data must be serializable (pickle) - Overhead from process creation - Memory overhead from data copying #### 3. Single Thread (Synchronous) **Description**: The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all. **When to Use**: - Debugging with pdb - Profiling with standard Python tools - Understanding errors in detail - Development and testing **Characteristics**: - No parallelism - Easy debugging - No overhead - Deterministic execution **Example**: ```python import dask # Enable for debugging dask.config.set(scheduler='synchronous') # Now can use pdb result = computation.compute() # Runs in single thread ``` **Debugging with IPython**: ```python # In IPython/Jupyter %pdb on dask.config.set(scheduler='synchronous') result = problematic_computation.compute() # Drops into debugger on error ``` ### Distributed Schedulers #### 4. Local Distributed **Description**: Despite its name, this scheduler runs effectively on personal machines using the distributed scheduler infrastructure. **When to Use**: - Need diagnostic dashboard - Asynchronous APIs - Better data locality handling than multiprocessing - Development before scaling to cluster - Want distributed features on single machine **Characteristics**: - Provides dashboard for monitoring - Better memory management - More overhead than threads/processes - Can scale to cluster later **Example**: ```python from dask.distributed import Client import dask.dataframe as dd # Create local cluster client = Client() # Automatically uses all cores # Use distributed scheduler ddf = dd.read_csv('data.csv') result = ddf.groupby('category').mean().compute() # View dashboard print(client.dashboard_link) # Clean up client.close() ``` **Configuration Options**: ```python # Control resources client = Client( n_workers=4, threads_per_worker=2, memory_limit='4GB' ) ``` #### 5. Cluster Distributed **Description**: For scaling across multiple machines using the distributed scheduler. **When to Use**: - Data exceeds single machine capacity - Need computational power beyond one machine - Production deployments - Cluster computing environments (HPC, cloud) **Characteristics**: - Scales to hundreds of machines - Requires cluster setup - Network communication overhead - Advanced features (adaptive scaling, task prioritization) **Example with Dask-Jobqueue (HPC)**: ```python from dask_jobqueue import SLURMCluster from dask.distributed import Client # Create cluster on HPC with SLURM cluster = SLURMCluster( cores=24, memory='100GB', walltime='02:00:00', queue='regular' ) # Scale to 10 jobs cluster.scale(jobs=10) # Connect client client = Client(cluster) # Run computation result = computation.compute() client.close() ``` **Example with Dask on Kubernetes**: ```python from dask_kubernetes import KubeCluster from dask.distributed import Client cluster = KubeCluster() cluster.scale(20) # 20 workers client = Client(cluster) result = computation.compute() client.close() ``` ## Scheduler Configuration ### Global Configuration ```python import dask # Set scheduler globally for session dask.config.set(scheduler='threads') dask.config.set(scheduler='processes') dask.config.set(scheduler='synchronous') ``` ### Context Manager ```python import dask # Temporarily use different scheduler with dask.config.set(scheduler='processes'): result = computation.compute() # Back to default scheduler result2 = computation2.compute() ``` ### Per-Compute ```python # Specify scheduler per compute call result = computation.compute(scheduler='threads') result = computation.compute(scheduler='processes') result = computation.compute(scheduler='synchronous') ``` ### Distributed Client ```python from dask.distributed import Client # Using client automatically sets distributed scheduler client = Client() # All computations use distributed scheduler result = computation.compute() client.close() ``` ## Choosing the Right Scheduler ### Decision Matrix | Workload Type | Recommended Scheduler | Rationale | |--------------|----------------------|-----------| | NumPy/Pandas operations | Threads (default) | GIL-releasing, shared memory | | Pure Python objects | Processes | Avoids GIL contention | | Text/log processing | Processes | Python-heavy operations | | Debugging | Synchronous | Easy debugging, deterministic | | Need dashboard | Local Distributed | Monitoring and diagnostics | | Multi-machine | Cluster Distributed | Exceeds single machine capacity | | Small data, quick tasks | Threads | Lowest overhead | | Large data, single machine | Local Distributed | Better memory management | ### Performance Considerations **Threads**: - Overhead: ~10 µs per task - Best for: Numeric operations - Memory: Shared - GIL: Affected by GIL **Processes**: - Overhead: ~10 ms per task - Best for: Python operations - Memory: Copied between processes - GIL: Not affected **Synchronous**: - Overhead: ~1 µs per task - Best for: Debugging - Memory: No parallelism - GIL: Not relevant **Distributed**: - Overhead: ~1 ms per task - Best for: Complex workflows, monitoring - Memory: Managed by scheduler - GIL: Workers can use threads or processes ## Thread Configuration for Distributed Scheduler ### Setting Thread Count ```python from dask.distributed import Client # Control thread/worker configuration client = Client( n_workers=4, # Number of worker processes threads_per_worker=2 # Threads per worker process ) ``` ### Recommended Configuration **For Numeric Workloads**: - Aim for roughly 4 threads per process - Balance between parallelism and overhead - Example: 8 cores → 2 workers with 4 threads each **For Python Workloads**: - Use more workers with fewer threads - Example: 8 cores → 8 workers with 1 thread each ### Environment Variables ```bash # Set thread count via environment export DASK_NUM_WORKERS=4 export DASK_THREADS_PER_WORKER=2 # Or via config file ``` ## Common Patterns ### Development to Production ```python # Development: Use local distributed for testing from dask.distributed import Client client = Client(processes=False) # In-process for debugging # Production: Scale to cluster from dask.distributed import Client client = Client('scheduler-address:8786') ``` ### Mixed Workloads ```python import dask import dask.dataframe as dd # Use threads for DataFrame operations ddf = dd.read_parquet('data.parquet') result1 = ddf.mean().compute(scheduler='threads') # Use processes for Python code import dask.bag as db bag = db.read_text('logs/*.txt') result2 = bag.map(parse_log).compute(scheduler='processes') ``` ### Debugging Workflow ```python import dask # Step 1: Debug with synchronous scheduler dask.config.set(scheduler='synchronous') result = problematic_computation.compute() # Step 2: Test with threads dask.config.set(scheduler='threads') result = computation.compute() # Step 3: Scale with distributed from dask.distributed import Client client = Client() result = computation.compute() ``` ## Monitoring and Diagnostics ### Dashboard Access (Distributed Only) ```python from dask.distributed import Client client = Client() # Get dashboard URL print(client.dashboard_link) # Opens dashboard in browser showing: # - Task progress # - Worker status # - Memory usage # - Task stream # - Resource utilization ``` ### Performance Profiling ```python # Profile computation from dask.distributed import Client client = Client() result = computation.compute() # Get performance report client.profile(filename='profile.html') ``` ### Resource Monitoring ```python # Check worker info client.scheduler_info() # Get current tasks client.who_has() # Memory usage client.run(lambda: psutil.virtual_memory().percent) ``` ## Advanced Configuration ### Custom Executors ```python from concurrent.futures import ThreadPoolExecutor import dask # Use custom thread pool with ThreadPoolExecutor(max_workers=4) as executor: dask.config.set(pool=executor) result = computation.compute(scheduler='threads') ``` ### Adaptive Scaling (Distributed) ```python from dask.distributed import Client client = Client() # Enable adaptive scaling client.cluster.adapt(minimum=2, maximum=10) # Cluster scales based on workload result = computation.compute() ``` ### Worker Plugins ```python from dask.distributed import Client, WorkerPlugin class CustomPlugin(WorkerPlugin): def setup(self, worker): # Initialize worker-specific resources worker.custom_resource = initialize_resource() client = Client() client.register_worker_plugin(CustomPlugin()) ``` ## Troubleshooting ### Slow Performance with Threads **Problem**: Pure Python code slow with threaded scheduler **Solution**: Switch to processes or distributed scheduler ### Memory Errors with Processes **Problem**: Data too large to pickle/copy between processes **Solution**: Use threaded or distributed scheduler ### Debugging Difficult **Problem**: Can't use pdb with parallel schedulers **Solution**: Use synchronous scheduler for debugging ### Task Overhead High **Problem**: Many tiny tasks causing overhead **Solution**: Use threaded scheduler (lowest overhead) or increase chunk sizes