gh-k-dense-ai-claude-scient…/skills/dask/references/best-practices.md

# Dask Best Practices

## Performance Optimization Principles

### Start with Simpler Solutions First

Before implementing parallel computing with Dask, explore these alternatives:
- Better algorithms for the specific problem
- Efficient file formats (Parquet, HDF5, Zarr instead of CSV)
- Compiled code via Numba or Cython
- Data sampling for development and testing

These alternatives often provide better returns than distributed systems and should be exhausted before scaling to parallel computing.

### Chunk Size Strategy

**Critical Rule**: Chunks should be small enough that many fit in a worker's available memory at once.

**Recommended Target**: Size chunks so workers can hold 10 chunks per core without exceeding available memory.

**Why It Matters**:
- Too large chunks: Memory overflow and inefficient parallelization
- Too small chunks: Excessive scheduling overhead

**Example Calculation**:
- 8 cores with 32 GB RAM
- Target: ~400 MB per chunk (32 GB / 8 cores / 10 chunks)

### Monitor with the Dashboard

The Dask dashboard provides essential visibility into:
- Worker states and resource utilization
- Task progress and bottlenecks
- Memory usage patterns
- Performance characteristics

Access the dashboard to understand what's actually slow in parallel workloads rather than guessing at optimizations.

## Critical Pitfalls to Avoid

### 1. Don't Create Large Objects Locally Before Dask

**Wrong Approach**:
```python
import pandas as pd
import dask.dataframe as dd

# Loads entire dataset into memory first
df = pd.read_csv('large_file.csv')
ddf = dd.from_pandas(df, npartitions=10)
```

**Correct Approach**:
```python
import dask.dataframe as dd

# Let Dask handle the loading
ddf = dd.read_csv('large_file.csv')
```

**Why**: Loading data with pandas or NumPy first forces the scheduler to serialize and embed those objects in task graphs, defeating the purpose of parallel computing.

**Key Principle**: Use Dask methods to load data and use Dask to control the results.

### 2. Avoid Repeated compute() Calls

**Wrong Approach**:
```python
results = []
for item in items:
    result = dask_computation(item).compute()  # Each compute is separate
    results.append(result)
```

**Correct Approach**:
```python
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)  # Single compute for all
```

**Why**: Calling compute in loops prevents Dask from:
- Parallelizing different computations
- Sharing intermediate results
- Optimizing the overall task graph

### 3. Don't Build Excessively Large Task Graphs

**Symptoms**:
- Millions of tasks in a single computation
- Severe scheduling overhead
- Long delays before computation starts

**Solutions**:
- Increase chunk sizes to reduce number of tasks
- Use `map_partitions` or `map_blocks` to fuse operations
- Break computations into smaller pieces with intermediate persists
- Consider whether the problem truly requires distributed computing

**Example Using map_partitions**:
```python
# Instead of applying function to each row
ddf['result'] = ddf.apply(complex_function, axis=1)  # Many tasks

# Apply to entire partitions at once
ddf = ddf.map_partitions(lambda df: df.assign(result=complex_function(df)))
```

## Infrastructure Considerations

### Scheduler Selection

**Use Threads For**:
- Numeric work with GIL-releasing libraries (NumPy, Pandas, scikit-learn)
- Operations that benefit from shared memory
- Single-machine workloads with array/dataframe operations

**Use Processes For**:
- Text processing and Python collection operations
- Pure Python code that's GIL-bound
- Operations that need process isolation

**Use Distributed Scheduler For**:
- Multi-machine clusters
- Need for diagnostic dashboard
- Asynchronous APIs
- Better data locality handling

### Thread Configuration

**Recommendation**: Aim for roughly 4 threads per process on numeric workloads.

**Rationale**:
- Balance between parallelism and overhead
- Allows efficient use of CPU cores
- Reduces context switching costs

### Memory Management

**Persist Strategically**:
```python
# Persist intermediate results that are reused
intermediate = expensive_computation(data).persist()
result1 = intermediate.operation1().compute()
result2 = intermediate.operation2().compute()
```

**Clear Memory When Done**:
```python
# Explicitly delete large objects
del intermediate
```

## Data Loading Best Practices

### Use Appropriate File Formats

**For Tabular Data**:
- Parquet: Columnar, compressed, fast filtering
- CSV: Only for small data or initial ingestion

**For Array Data**:
- HDF5: Good for numeric arrays
- Zarr: Cloud-native, parallel-friendly
- NetCDF: Scientific data with metadata

### Optimize Data Ingestion

**Read Multiple Files Efficiently**:
```python
# Use glob patterns to read multiple files in parallel
ddf = dd.read_parquet('data/year=2024/month=*/day=*.parquet')
```

**Specify Useful Columns Early**:
```python
# Only read needed columns
ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2', 'col3'])
```

## Common Patterns and Solutions

### Pattern: Embarrassingly Parallel Problems

For independent computations, use Futures:
```python
from dask.distributed import Client

client = Client()
futures = [client.submit(func, arg) for arg in args]
results = client.gather(futures)
```

### Pattern: Data Preprocessing Pipeline

Use Bags for initial ETL, then convert to structured formats:
```python
import dask.bag as db

# Process raw JSON
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'success')

# Convert to DataFrame for analysis
ddf = bag.to_dataframe()
```

### Pattern: Iterative Algorithms

Persist data between iterations:
```python
data = dd.read_parquet('data.parquet')
data = data.persist()  # Keep in memory across iterations

for iteration in range(num_iterations):
    data = update_function(data)
    data = data.persist()  # Persist updated version
```

## Debugging Tips

### Use Single-Threaded Scheduler

For debugging with pdb or detailed error inspection:
```python
import dask

dask.config.set(scheduler='synchronous')
result = computation.compute()  # Runs in single thread for debugging
```

### Check Task Graph Size

Before computing, check the number of tasks:
```python
print(len(ddf.__dask_graph__()))  # Should be reasonable, not millions
```

### Validate on Small Data First

Test logic on small subset before scaling:
```python
# Test on first partition
sample = ddf.head(1000)
# Validate results
# Then scale to full dataset
```

## Performance Troubleshooting

### Symptom: Slow Computation Start

**Likely Cause**: Task graph is too large
**Solution**: Increase chunk sizes or use map_partitions

### Symptom: Memory Errors

**Likely Causes**:
- Chunks too large
- Too many intermediate results
- Memory leaks in user functions

**Solutions**:
- Decrease chunk sizes
- Use persist() strategically and delete when done
- Profile user functions for memory issues

### Symptom: Poor Parallelization

**Likely Causes**:
- Data dependencies preventing parallelism
- Chunks too large (not enough tasks)
- GIL contention with threads on Python code

**Solutions**:
- Restructure computation to reduce dependencies
- Increase number of partitions
- Switch to multiprocessing scheduler for Python code