Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/dask/references/schedulers.md
+++ b/skills/dask/references/schedulers.md
@@ -0,0 +1,504 @@
+# Dask Schedulers
+
+## Overview
+
+Dask provides multiple task schedulers, each suited to different workloads. The scheduler determines how tasks are executed: sequentially, in parallel threads, in parallel processes, or distributed across a cluster.
+
+## Scheduler Types
+
+### Single-Machine Schedulers
+
+#### 1. Local Threads (Default)
+
+**Description**: The threaded scheduler executes computations with a local `concurrent.futures.ThreadPoolExecutor`.
+
+**When to Use**:
+- Numeric computations in NumPy, Pandas, scikit-learn
+- Libraries that release the GIL (Global Interpreter Lock)
+- Operations benefit from shared memory access
+- Default for Dask Arrays and DataFrames
+
+**Characteristics**:
+- Low overhead
+- Shared memory between threads
+- Best for GIL-releasing operations
+- Poor for pure Python code (GIL contention)
+
+**Example**:
+```python
+import dask.array as da
+
+# Uses threads by default
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+result = x.mean().compute()  # Computed with threads
+```
+
+**Explicit Configuration**:
+```python
+import dask
+
+# Set globally
+dask.config.set(scheduler='threads')
+
+# Or per-compute
+result = x.mean().compute(scheduler='threads')
+```
+
+#### 2. Local Processes
+
+**Description**: Multiprocessing scheduler that uses `concurrent.futures.ProcessPoolExecutor`.
+
+**When to Use**:
+- Pure Python code with GIL contention
+- Text processing and Python collections
+- Operations that benefit from process isolation
+- CPU-bound Python code
+
+**Characteristics**:
+- Bypasses GIL limitations
+- Incurs data transfer costs between processes
+- Higher overhead than threads
+- Ideal for linear workflows with small inputs/outputs
+
+**Example**:
+```python
+import dask.bag as db
+
+# Good for Python object processing
+bag = db.read_text('data/*.txt')
+result = bag.map(complex_python_function).compute(scheduler='processes')
+```
+
+**Explicit Configuration**:
+```python
+import dask
+
+# Set globally
+dask.config.set(scheduler='processes')
+
+# Or per-compute
+result = computation.compute(scheduler='processes')
+```
+
+**Limitations**:
+- Data must be serializable (pickle)
+- Overhead from process creation
+- Memory overhead from data copying
+
+#### 3. Single Thread (Synchronous)
+
+**Description**: The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all.
+
+**When to Use**:
+- Debugging with pdb
+- Profiling with standard Python tools
+- Understanding errors in detail
+- Development and testing
+
+**Characteristics**:
+- No parallelism
+- Easy debugging
+- No overhead
+- Deterministic execution
+
+**Example**:
+```python
+import dask
+
+# Enable for debugging
+dask.config.set(scheduler='synchronous')
+
+# Now can use pdb
+result = computation.compute()  # Runs in single thread
+```
+
+**Debugging with IPython**:
+```python
+# In IPython/Jupyter
+%pdb on
+
+dask.config.set(scheduler='synchronous')
+result = problematic_computation.compute()  # Drops into debugger on error
+```
+
+### Distributed Schedulers
+
+#### 4. Local Distributed
+
+**Description**: Despite its name, this scheduler runs effectively on personal machines using the distributed scheduler infrastructure.
+
+**When to Use**:
+- Need diagnostic dashboard
+- Asynchronous APIs
+- Better data locality handling than multiprocessing
+- Development before scaling to cluster
+- Want distributed features on single machine
+
+**Characteristics**:
+- Provides dashboard for monitoring
+- Better memory management
+- More overhead than threads/processes
+- Can scale to cluster later
+
+**Example**:
+```python
+from dask.distributed import Client
+import dask.dataframe as dd
+
+# Create local cluster
+client = Client()  # Automatically uses all cores
+
+# Use distributed scheduler
+ddf = dd.read_csv('data.csv')
+result = ddf.groupby('category').mean().compute()
+
+# View dashboard
+print(client.dashboard_link)
+
+# Clean up
+client.close()
+```
+
+**Configuration Options**:
+```python
+# Control resources
+client = Client(
+    n_workers=4,
+    threads_per_worker=2,
+    memory_limit='4GB'
+)
+```
+
+#### 5. Cluster Distributed
+
+**Description**: For scaling across multiple machines using the distributed scheduler.
+
+**When to Use**:
+- Data exceeds single machine capacity
+- Need computational power beyond one machine
+- Production deployments
+- Cluster computing environments (HPC, cloud)
+
+**Characteristics**:
+- Scales to hundreds of machines
+- Requires cluster setup
+- Network communication overhead
+- Advanced features (adaptive scaling, task prioritization)
+
+**Example with Dask-Jobqueue (HPC)**:
+```python
+from dask_jobqueue import SLURMCluster
+from dask.distributed import Client
+
+# Create cluster on HPC with SLURM
+cluster = SLURMCluster(
+    cores=24,
+    memory='100GB',
+    walltime='02:00:00',
+    queue='regular'
+)
+
+# Scale to 10 jobs
+cluster.scale(jobs=10)
+
+# Connect client
+client = Client(cluster)
+
+# Run computation
+result = computation.compute()
+
+client.close()
+```
+
+**Example with Dask on Kubernetes**:
+```python
+from dask_kubernetes import KubeCluster
+from dask.distributed import Client
+
+cluster = KubeCluster()
+cluster.scale(20)  # 20 workers
+
+client = Client(cluster)
+result = computation.compute()
+
+client.close()
+```
+
+## Scheduler Configuration
+
+### Global Configuration
+
+```python
+import dask
+
+# Set scheduler globally for session
+dask.config.set(scheduler='threads')
+dask.config.set(scheduler='processes')
+dask.config.set(scheduler='synchronous')
+```
+
+### Context Manager
+
+```python
+import dask
+
+# Temporarily use different scheduler
+with dask.config.set(scheduler='processes'):
+    result = computation.compute()
+
+# Back to default scheduler
+result2 = computation2.compute()
+```
+
+### Per-Compute
+
+```python
+# Specify scheduler per compute call
+result = computation.compute(scheduler='threads')
+result = computation.compute(scheduler='processes')
+result = computation.compute(scheduler='synchronous')
+```
+
+### Distributed Client
+
+```python
+from dask.distributed import Client
+
+# Using client automatically sets distributed scheduler
+client = Client()
+
+# All computations use distributed scheduler
+result = computation.compute()
+
+client.close()
+```
+
+## Choosing the Right Scheduler
+
+### Decision Matrix
+
+| Workload Type | Recommended Scheduler | Rationale |
+|--------------|----------------------|-----------|
+| NumPy/Pandas operations | Threads (default) | GIL-releasing, shared memory |
+| Pure Python objects | Processes | Avoids GIL contention |
+| Text/log processing | Processes | Python-heavy operations |
+| Debugging | Synchronous | Easy debugging, deterministic |
+| Need dashboard | Local Distributed | Monitoring and diagnostics |
+| Multi-machine | Cluster Distributed | Exceeds single machine capacity |
+| Small data, quick tasks | Threads | Lowest overhead |
+| Large data, single machine | Local Distributed | Better memory management |
+
+### Performance Considerations
+
+**Threads**:
+- Overhead: ~10 µs per task
+- Best for: Numeric operations
+- Memory: Shared
+- GIL: Affected by GIL
+
+**Processes**:
+- Overhead: ~10 ms per task
+- Best for: Python operations
+- Memory: Copied between processes
+- GIL: Not affected
+
+**Synchronous**:
+- Overhead: ~1 µs per task
+- Best for: Debugging
+- Memory: No parallelism
+- GIL: Not relevant
+
+**Distributed**:
+- Overhead: ~1 ms per task
+- Best for: Complex workflows, monitoring
+- Memory: Managed by scheduler
+- GIL: Workers can use threads or processes
+
+## Thread Configuration for Distributed Scheduler
+
+### Setting Thread Count
+
+```python
+from dask.distributed import Client
+
+# Control thread/worker configuration
+client = Client(
+    n_workers=4,           # Number of worker processes
+    threads_per_worker=2   # Threads per worker process
+)
+```
+
+### Recommended Configuration
+
+**For Numeric Workloads**:
+- Aim for roughly 4 threads per process
+- Balance between parallelism and overhead
+- Example: 8 cores → 2 workers with 4 threads each
+
+**For Python Workloads**:
+- Use more workers with fewer threads
+- Example: 8 cores → 8 workers with 1 thread each
+
+### Environment Variables
+
+```bash
+# Set thread count via environment
+export DASK_NUM_WORKERS=4
+export DASK_THREADS_PER_WORKER=2
+
+# Or via config file
+```
+
+## Common Patterns
+
+### Development to Production
+
+```python
+# Development: Use local distributed for testing
+from dask.distributed import Client
+client = Client(processes=False)  # In-process for debugging
+
+# Production: Scale to cluster
+from dask.distributed import Client
+client = Client('scheduler-address:8786')
+```
+
+### Mixed Workloads
+
+```python
+import dask
+import dask.dataframe as dd
+
+# Use threads for DataFrame operations
+ddf = dd.read_parquet('data.parquet')
+result1 = ddf.mean().compute(scheduler='threads')
+
+# Use processes for Python code
+import dask.bag as db
+bag = db.read_text('logs/*.txt')
+result2 = bag.map(parse_log).compute(scheduler='processes')
+```
+
+### Debugging Workflow
+
+```python
+import dask
+
+# Step 1: Debug with synchronous scheduler
+dask.config.set(scheduler='synchronous')
+result = problematic_computation.compute()
+
+# Step 2: Test with threads
+dask.config.set(scheduler='threads')
+result = computation.compute()
+
+# Step 3: Scale with distributed
+from dask.distributed import Client
+client = Client()
+result = computation.compute()
+```
+
+## Monitoring and Diagnostics
+
+### Dashboard Access (Distributed Only)
+
+```python
+from dask.distributed import Client
+
+client = Client()
+
+# Get dashboard URL
+print(client.dashboard_link)
+# Opens dashboard in browser showing:
+# - Task progress
+# - Worker status
+# - Memory usage
+# - Task stream
+# - Resource utilization
+```
+
+### Performance Profiling
+
+```python
+# Profile computation
+from dask.distributed import Client
+
+client = Client()
+result = computation.compute()
+
+# Get performance report
+client.profile(filename='profile.html')
+```
+
+### Resource Monitoring
+
+```python
+# Check worker info
+client.scheduler_info()
+
+# Get current tasks
+client.who_has()
+
+# Memory usage
+client.run(lambda: psutil.virtual_memory().percent)
+```
+
+## Advanced Configuration
+
+### Custom Executors
+
+```python
+from concurrent.futures import ThreadPoolExecutor
+import dask
+
+# Use custom thread pool
+with ThreadPoolExecutor(max_workers=4) as executor:
+    dask.config.set(pool=executor)
+    result = computation.compute(scheduler='threads')
+```
+
+### Adaptive Scaling (Distributed)
+
+```python
+from dask.distributed import Client
+
+client = Client()
+
+# Enable adaptive scaling
+client.cluster.adapt(minimum=2, maximum=10)
+
+# Cluster scales based on workload
+result = computation.compute()
+```
+
+### Worker Plugins
+
+```python
+from dask.distributed import Client, WorkerPlugin
+
+class CustomPlugin(WorkerPlugin):
+    def setup(self, worker):
+        # Initialize worker-specific resources
+        worker.custom_resource = initialize_resource()
+
+client = Client()
+client.register_worker_plugin(CustomPlugin())
+```
+
+## Troubleshooting
+
+### Slow Performance with Threads
+**Problem**: Pure Python code slow with threaded scheduler
+**Solution**: Switch to processes or distributed scheduler
+
+### Memory Errors with Processes
+**Problem**: Data too large to pickle/copy between processes
+**Solution**: Use threaded or distributed scheduler
+
+### Debugging Difficult
+**Problem**: Can't use pdb with parallel schedulers
+**Solution**: Use synchronous scheduler for debugging
+
+### Task Overhead High
+**Problem**: Many tiny tasks causing overhead
+**Solution**: Use threaded scheduler (lowest overhead) or increase chunk sizes