Files
gh-k-dense-ai-claude-scient…/skills/dask/references/best-practices.md
2025-11-30 08:30:10 +08:00

278 lines
7.1 KiB
Markdown

# Dask Best Practices
## Performance Optimization Principles
### Start with Simpler Solutions First
Before implementing parallel computing with Dask, explore these alternatives:
- Better algorithms for the specific problem
- Efficient file formats (Parquet, HDF5, Zarr instead of CSV)
- Compiled code via Numba or Cython
- Data sampling for development and testing
These alternatives often provide better returns than distributed systems and should be exhausted before scaling to parallel computing.
### Chunk Size Strategy
**Critical Rule**: Chunks should be small enough that many fit in a worker's available memory at once.
**Recommended Target**: Size chunks so workers can hold 10 chunks per core without exceeding available memory.
**Why It Matters**:
- Too large chunks: Memory overflow and inefficient parallelization
- Too small chunks: Excessive scheduling overhead
**Example Calculation**:
- 8 cores with 32 GB RAM
- Target: ~400 MB per chunk (32 GB / 8 cores / 10 chunks)
### Monitor with the Dashboard
The Dask dashboard provides essential visibility into:
- Worker states and resource utilization
- Task progress and bottlenecks
- Memory usage patterns
- Performance characteristics
Access the dashboard to understand what's actually slow in parallel workloads rather than guessing at optimizations.
## Critical Pitfalls to Avoid
### 1. Don't Create Large Objects Locally Before Dask
**Wrong Approach**:
```python
import pandas as pd
import dask.dataframe as dd
# Loads entire dataset into memory first
df = pd.read_csv('large_file.csv')
ddf = dd.from_pandas(df, npartitions=10)
```
**Correct Approach**:
```python
import dask.dataframe as dd
# Let Dask handle the loading
ddf = dd.read_csv('large_file.csv')
```
**Why**: Loading data with pandas or NumPy first forces the scheduler to serialize and embed those objects in task graphs, defeating the purpose of parallel computing.
**Key Principle**: Use Dask methods to load data and use Dask to control the results.
### 2. Avoid Repeated compute() Calls
**Wrong Approach**:
```python
results = []
for item in items:
result = dask_computation(item).compute() # Each compute is separate
results.append(result)
```
**Correct Approach**:
```python
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations) # Single compute for all
```
**Why**: Calling compute in loops prevents Dask from:
- Parallelizing different computations
- Sharing intermediate results
- Optimizing the overall task graph
### 3. Don't Build Excessively Large Task Graphs
**Symptoms**:
- Millions of tasks in a single computation
- Severe scheduling overhead
- Long delays before computation starts
**Solutions**:
- Increase chunk sizes to reduce number of tasks
- Use `map_partitions` or `map_blocks` to fuse operations
- Break computations into smaller pieces with intermediate persists
- Consider whether the problem truly requires distributed computing
**Example Using map_partitions**:
```python
# Instead of applying function to each row
ddf['result'] = ddf.apply(complex_function, axis=1) # Many tasks
# Apply to entire partitions at once
ddf = ddf.map_partitions(lambda df: df.assign(result=complex_function(df)))
```
## Infrastructure Considerations
### Scheduler Selection
**Use Threads For**:
- Numeric work with GIL-releasing libraries (NumPy, Pandas, scikit-learn)
- Operations that benefit from shared memory
- Single-machine workloads with array/dataframe operations
**Use Processes For**:
- Text processing and Python collection operations
- Pure Python code that's GIL-bound
- Operations that need process isolation
**Use Distributed Scheduler For**:
- Multi-machine clusters
- Need for diagnostic dashboard
- Asynchronous APIs
- Better data locality handling
### Thread Configuration
**Recommendation**: Aim for roughly 4 threads per process on numeric workloads.
**Rationale**:
- Balance between parallelism and overhead
- Allows efficient use of CPU cores
- Reduces context switching costs
### Memory Management
**Persist Strategically**:
```python
# Persist intermediate results that are reused
intermediate = expensive_computation(data).persist()
result1 = intermediate.operation1().compute()
result2 = intermediate.operation2().compute()
```
**Clear Memory When Done**:
```python
# Explicitly delete large objects
del intermediate
```
## Data Loading Best Practices
### Use Appropriate File Formats
**For Tabular Data**:
- Parquet: Columnar, compressed, fast filtering
- CSV: Only for small data or initial ingestion
**For Array Data**:
- HDF5: Good for numeric arrays
- Zarr: Cloud-native, parallel-friendly
- NetCDF: Scientific data with metadata
### Optimize Data Ingestion
**Read Multiple Files Efficiently**:
```python
# Use glob patterns to read multiple files in parallel
ddf = dd.read_parquet('data/year=2024/month=*/day=*.parquet')
```
**Specify Useful Columns Early**:
```python
# Only read needed columns
ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2', 'col3'])
```
## Common Patterns and Solutions
### Pattern: Embarrassingly Parallel Problems
For independent computations, use Futures:
```python
from dask.distributed import Client
client = Client()
futures = [client.submit(func, arg) for arg in args]
results = client.gather(futures)
```
### Pattern: Data Preprocessing Pipeline
Use Bags for initial ETL, then convert to structured formats:
```python
import dask.bag as db
# Process raw JSON
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'success')
# Convert to DataFrame for analysis
ddf = bag.to_dataframe()
```
### Pattern: Iterative Algorithms
Persist data between iterations:
```python
data = dd.read_parquet('data.parquet')
data = data.persist() # Keep in memory across iterations
for iteration in range(num_iterations):
data = update_function(data)
data = data.persist() # Persist updated version
```
## Debugging Tips
### Use Single-Threaded Scheduler
For debugging with pdb or detailed error inspection:
```python
import dask
dask.config.set(scheduler='synchronous')
result = computation.compute() # Runs in single thread for debugging
```
### Check Task Graph Size
Before computing, check the number of tasks:
```python
print(len(ddf.__dask_graph__())) # Should be reasonable, not millions
```
### Validate on Small Data First
Test logic on small subset before scaling:
```python
# Test on first partition
sample = ddf.head(1000)
# Validate results
# Then scale to full dataset
```
## Performance Troubleshooting
### Symptom: Slow Computation Start
**Likely Cause**: Task graph is too large
**Solution**: Increase chunk sizes or use map_partitions
### Symptom: Memory Errors
**Likely Causes**:
- Chunks too large
- Too many intermediate results
- Memory leaks in user functions
**Solutions**:
- Decrease chunk sizes
- Use persist() strategically and delete when done
- Profile user functions for memory issues
### Symptom: Poor Parallelization
**Likely Causes**:
- Data dependencies preventing parallelism
- Chunks too large (not enough tasks)
- GIL contention with threads on Python code
**Solutions**:
- Restructure computation to reduce dependencies
- Increase number of partitions
- Switch to multiprocessing scheduler for Python code