Initial commit
This commit is contained in:
277
skills/dask/references/best-practices.md
Normal file
277
skills/dask/references/best-practices.md
Normal file
@@ -0,0 +1,277 @@
|
||||
# Dask Best Practices
|
||||
|
||||
## Performance Optimization Principles
|
||||
|
||||
### Start with Simpler Solutions First
|
||||
|
||||
Before implementing parallel computing with Dask, explore these alternatives:
|
||||
- Better algorithms for the specific problem
|
||||
- Efficient file formats (Parquet, HDF5, Zarr instead of CSV)
|
||||
- Compiled code via Numba or Cython
|
||||
- Data sampling for development and testing
|
||||
|
||||
These alternatives often provide better returns than distributed systems and should be exhausted before scaling to parallel computing.
|
||||
|
||||
### Chunk Size Strategy
|
||||
|
||||
**Critical Rule**: Chunks should be small enough that many fit in a worker's available memory at once.
|
||||
|
||||
**Recommended Target**: Size chunks so workers can hold 10 chunks per core without exceeding available memory.
|
||||
|
||||
**Why It Matters**:
|
||||
- Too large chunks: Memory overflow and inefficient parallelization
|
||||
- Too small chunks: Excessive scheduling overhead
|
||||
|
||||
**Example Calculation**:
|
||||
- 8 cores with 32 GB RAM
|
||||
- Target: ~400 MB per chunk (32 GB / 8 cores / 10 chunks)
|
||||
|
||||
### Monitor with the Dashboard
|
||||
|
||||
The Dask dashboard provides essential visibility into:
|
||||
- Worker states and resource utilization
|
||||
- Task progress and bottlenecks
|
||||
- Memory usage patterns
|
||||
- Performance characteristics
|
||||
|
||||
Access the dashboard to understand what's actually slow in parallel workloads rather than guessing at optimizations.
|
||||
|
||||
## Critical Pitfalls to Avoid
|
||||
|
||||
### 1. Don't Create Large Objects Locally Before Dask
|
||||
|
||||
**Wrong Approach**:
|
||||
```python
|
||||
import pandas as pd
|
||||
import dask.dataframe as dd
|
||||
|
||||
# Loads entire dataset into memory first
|
||||
df = pd.read_csv('large_file.csv')
|
||||
ddf = dd.from_pandas(df, npartitions=10)
|
||||
```
|
||||
|
||||
**Correct Approach**:
|
||||
```python
|
||||
import dask.dataframe as dd
|
||||
|
||||
# Let Dask handle the loading
|
||||
ddf = dd.read_csv('large_file.csv')
|
||||
```
|
||||
|
||||
**Why**: Loading data with pandas or NumPy first forces the scheduler to serialize and embed those objects in task graphs, defeating the purpose of parallel computing.
|
||||
|
||||
**Key Principle**: Use Dask methods to load data and use Dask to control the results.
|
||||
|
||||
### 2. Avoid Repeated compute() Calls
|
||||
|
||||
**Wrong Approach**:
|
||||
```python
|
||||
results = []
|
||||
for item in items:
|
||||
result = dask_computation(item).compute() # Each compute is separate
|
||||
results.append(result)
|
||||
```
|
||||
|
||||
**Correct Approach**:
|
||||
```python
|
||||
computations = [dask_computation(item) for item in items]
|
||||
results = dask.compute(*computations) # Single compute for all
|
||||
```
|
||||
|
||||
**Why**: Calling compute in loops prevents Dask from:
|
||||
- Parallelizing different computations
|
||||
- Sharing intermediate results
|
||||
- Optimizing the overall task graph
|
||||
|
||||
### 3. Don't Build Excessively Large Task Graphs
|
||||
|
||||
**Symptoms**:
|
||||
- Millions of tasks in a single computation
|
||||
- Severe scheduling overhead
|
||||
- Long delays before computation starts
|
||||
|
||||
**Solutions**:
|
||||
- Increase chunk sizes to reduce number of tasks
|
||||
- Use `map_partitions` or `map_blocks` to fuse operations
|
||||
- Break computations into smaller pieces with intermediate persists
|
||||
- Consider whether the problem truly requires distributed computing
|
||||
|
||||
**Example Using map_partitions**:
|
||||
```python
|
||||
# Instead of applying function to each row
|
||||
ddf['result'] = ddf.apply(complex_function, axis=1) # Many tasks
|
||||
|
||||
# Apply to entire partitions at once
|
||||
ddf = ddf.map_partitions(lambda df: df.assign(result=complex_function(df)))
|
||||
```
|
||||
|
||||
## Infrastructure Considerations
|
||||
|
||||
### Scheduler Selection
|
||||
|
||||
**Use Threads For**:
|
||||
- Numeric work with GIL-releasing libraries (NumPy, Pandas, scikit-learn)
|
||||
- Operations that benefit from shared memory
|
||||
- Single-machine workloads with array/dataframe operations
|
||||
|
||||
**Use Processes For**:
|
||||
- Text processing and Python collection operations
|
||||
- Pure Python code that's GIL-bound
|
||||
- Operations that need process isolation
|
||||
|
||||
**Use Distributed Scheduler For**:
|
||||
- Multi-machine clusters
|
||||
- Need for diagnostic dashboard
|
||||
- Asynchronous APIs
|
||||
- Better data locality handling
|
||||
|
||||
### Thread Configuration
|
||||
|
||||
**Recommendation**: Aim for roughly 4 threads per process on numeric workloads.
|
||||
|
||||
**Rationale**:
|
||||
- Balance between parallelism and overhead
|
||||
- Allows efficient use of CPU cores
|
||||
- Reduces context switching costs
|
||||
|
||||
### Memory Management
|
||||
|
||||
**Persist Strategically**:
|
||||
```python
|
||||
# Persist intermediate results that are reused
|
||||
intermediate = expensive_computation(data).persist()
|
||||
result1 = intermediate.operation1().compute()
|
||||
result2 = intermediate.operation2().compute()
|
||||
```
|
||||
|
||||
**Clear Memory When Done**:
|
||||
```python
|
||||
# Explicitly delete large objects
|
||||
del intermediate
|
||||
```
|
||||
|
||||
## Data Loading Best Practices
|
||||
|
||||
### Use Appropriate File Formats
|
||||
|
||||
**For Tabular Data**:
|
||||
- Parquet: Columnar, compressed, fast filtering
|
||||
- CSV: Only for small data or initial ingestion
|
||||
|
||||
**For Array Data**:
|
||||
- HDF5: Good for numeric arrays
|
||||
- Zarr: Cloud-native, parallel-friendly
|
||||
- NetCDF: Scientific data with metadata
|
||||
|
||||
### Optimize Data Ingestion
|
||||
|
||||
**Read Multiple Files Efficiently**:
|
||||
```python
|
||||
# Use glob patterns to read multiple files in parallel
|
||||
ddf = dd.read_parquet('data/year=2024/month=*/day=*.parquet')
|
||||
```
|
||||
|
||||
**Specify Useful Columns Early**:
|
||||
```python
|
||||
# Only read needed columns
|
||||
ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2', 'col3'])
|
||||
```
|
||||
|
||||
## Common Patterns and Solutions
|
||||
|
||||
### Pattern: Embarrassingly Parallel Problems
|
||||
|
||||
For independent computations, use Futures:
|
||||
```python
|
||||
from dask.distributed import Client
|
||||
|
||||
client = Client()
|
||||
futures = [client.submit(func, arg) for arg in args]
|
||||
results = client.gather(futures)
|
||||
```
|
||||
|
||||
### Pattern: Data Preprocessing Pipeline
|
||||
|
||||
Use Bags for initial ETL, then convert to structured formats:
|
||||
```python
|
||||
import dask.bag as db
|
||||
|
||||
# Process raw JSON
|
||||
bag = db.read_text('logs/*.json').map(json.loads)
|
||||
bag = bag.filter(lambda x: x['status'] == 'success')
|
||||
|
||||
# Convert to DataFrame for analysis
|
||||
ddf = bag.to_dataframe()
|
||||
```
|
||||
|
||||
### Pattern: Iterative Algorithms
|
||||
|
||||
Persist data between iterations:
|
||||
```python
|
||||
data = dd.read_parquet('data.parquet')
|
||||
data = data.persist() # Keep in memory across iterations
|
||||
|
||||
for iteration in range(num_iterations):
|
||||
data = update_function(data)
|
||||
data = data.persist() # Persist updated version
|
||||
```
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Use Single-Threaded Scheduler
|
||||
|
||||
For debugging with pdb or detailed error inspection:
|
||||
```python
|
||||
import dask
|
||||
|
||||
dask.config.set(scheduler='synchronous')
|
||||
result = computation.compute() # Runs in single thread for debugging
|
||||
```
|
||||
|
||||
### Check Task Graph Size
|
||||
|
||||
Before computing, check the number of tasks:
|
||||
```python
|
||||
print(len(ddf.__dask_graph__())) # Should be reasonable, not millions
|
||||
```
|
||||
|
||||
### Validate on Small Data First
|
||||
|
||||
Test logic on small subset before scaling:
|
||||
```python
|
||||
# Test on first partition
|
||||
sample = ddf.head(1000)
|
||||
# Validate results
|
||||
# Then scale to full dataset
|
||||
```
|
||||
|
||||
## Performance Troubleshooting
|
||||
|
||||
### Symptom: Slow Computation Start
|
||||
|
||||
**Likely Cause**: Task graph is too large
|
||||
**Solution**: Increase chunk sizes or use map_partitions
|
||||
|
||||
### Symptom: Memory Errors
|
||||
|
||||
**Likely Causes**:
|
||||
- Chunks too large
|
||||
- Too many intermediate results
|
||||
- Memory leaks in user functions
|
||||
|
||||
**Solutions**:
|
||||
- Decrease chunk sizes
|
||||
- Use persist() strategically and delete when done
|
||||
- Profile user functions for memory issues
|
||||
|
||||
### Symptom: Poor Parallelization
|
||||
|
||||
**Likely Causes**:
|
||||
- Data dependencies preventing parallelism
|
||||
- Chunks too large (not enough tasks)
|
||||
- GIL contention with threads on Python code
|
||||
|
||||
**Solutions**:
|
||||
- Restructure computation to reduce dependencies
|
||||
- Increase number of partitions
|
||||
- Switch to multiprocessing scheduler for Python code
|
||||
Reference in New Issue
Block a user