Files
gh-k-dense-ai-claude-scient…/skills/dask/references/best-practices.md
2025-11-30 08:30:10 +08:00

7.1 KiB

Dask Best Practices

Performance Optimization Principles

Start with Simpler Solutions First

Before implementing parallel computing with Dask, explore these alternatives:

  • Better algorithms for the specific problem
  • Efficient file formats (Parquet, HDF5, Zarr instead of CSV)
  • Compiled code via Numba or Cython
  • Data sampling for development and testing

These alternatives often provide better returns than distributed systems and should be exhausted before scaling to parallel computing.

Chunk Size Strategy

Critical Rule: Chunks should be small enough that many fit in a worker's available memory at once.

Recommended Target: Size chunks so workers can hold 10 chunks per core without exceeding available memory.

Why It Matters:

  • Too large chunks: Memory overflow and inefficient parallelization
  • Too small chunks: Excessive scheduling overhead

Example Calculation:

  • 8 cores with 32 GB RAM
  • Target: ~400 MB per chunk (32 GB / 8 cores / 10 chunks)

Monitor with the Dashboard

The Dask dashboard provides essential visibility into:

  • Worker states and resource utilization
  • Task progress and bottlenecks
  • Memory usage patterns
  • Performance characteristics

Access the dashboard to understand what's actually slow in parallel workloads rather than guessing at optimizations.

Critical Pitfalls to Avoid

1. Don't Create Large Objects Locally Before Dask

Wrong Approach:

import pandas as pd
import dask.dataframe as dd

# Loads entire dataset into memory first
df = pd.read_csv('large_file.csv')
ddf = dd.from_pandas(df, npartitions=10)

Correct Approach:

import dask.dataframe as dd

# Let Dask handle the loading
ddf = dd.read_csv('large_file.csv')

Why: Loading data with pandas or NumPy first forces the scheduler to serialize and embed those objects in task graphs, defeating the purpose of parallel computing.

Key Principle: Use Dask methods to load data and use Dask to control the results.

2. Avoid Repeated compute() Calls

Wrong Approach:

results = []
for item in items:
    result = dask_computation(item).compute()  # Each compute is separate
    results.append(result)

Correct Approach:

computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)  # Single compute for all

Why: Calling compute in loops prevents Dask from:

  • Parallelizing different computations
  • Sharing intermediate results
  • Optimizing the overall task graph

3. Don't Build Excessively Large Task Graphs

Symptoms:

  • Millions of tasks in a single computation
  • Severe scheduling overhead
  • Long delays before computation starts

Solutions:

  • Increase chunk sizes to reduce number of tasks
  • Use map_partitions or map_blocks to fuse operations
  • Break computations into smaller pieces with intermediate persists
  • Consider whether the problem truly requires distributed computing

Example Using map_partitions:

# Instead of applying function to each row
ddf['result'] = ddf.apply(complex_function, axis=1)  # Many tasks

# Apply to entire partitions at once
ddf = ddf.map_partitions(lambda df: df.assign(result=complex_function(df)))

Infrastructure Considerations

Scheduler Selection

Use Threads For:

  • Numeric work with GIL-releasing libraries (NumPy, Pandas, scikit-learn)
  • Operations that benefit from shared memory
  • Single-machine workloads with array/dataframe operations

Use Processes For:

  • Text processing and Python collection operations
  • Pure Python code that's GIL-bound
  • Operations that need process isolation

Use Distributed Scheduler For:

  • Multi-machine clusters
  • Need for diagnostic dashboard
  • Asynchronous APIs
  • Better data locality handling

Thread Configuration

Recommendation: Aim for roughly 4 threads per process on numeric workloads.

Rationale:

  • Balance between parallelism and overhead
  • Allows efficient use of CPU cores
  • Reduces context switching costs

Memory Management

Persist Strategically:

# Persist intermediate results that are reused
intermediate = expensive_computation(data).persist()
result1 = intermediate.operation1().compute()
result2 = intermediate.operation2().compute()

Clear Memory When Done:

# Explicitly delete large objects
del intermediate

Data Loading Best Practices

Use Appropriate File Formats

For Tabular Data:

  • Parquet: Columnar, compressed, fast filtering
  • CSV: Only for small data or initial ingestion

For Array Data:

  • HDF5: Good for numeric arrays
  • Zarr: Cloud-native, parallel-friendly
  • NetCDF: Scientific data with metadata

Optimize Data Ingestion

Read Multiple Files Efficiently:

# Use glob patterns to read multiple files in parallel
ddf = dd.read_parquet('data/year=2024/month=*/day=*.parquet')

Specify Useful Columns Early:

# Only read needed columns
ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2', 'col3'])

Common Patterns and Solutions

Pattern: Embarrassingly Parallel Problems

For independent computations, use Futures:

from dask.distributed import Client

client = Client()
futures = [client.submit(func, arg) for arg in args]
results = client.gather(futures)

Pattern: Data Preprocessing Pipeline

Use Bags for initial ETL, then convert to structured formats:

import dask.bag as db

# Process raw JSON
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'success')

# Convert to DataFrame for analysis
ddf = bag.to_dataframe()

Pattern: Iterative Algorithms

Persist data between iterations:

data = dd.read_parquet('data.parquet')
data = data.persist()  # Keep in memory across iterations

for iteration in range(num_iterations):
    data = update_function(data)
    data = data.persist()  # Persist updated version

Debugging Tips

Use Single-Threaded Scheduler

For debugging with pdb or detailed error inspection:

import dask

dask.config.set(scheduler='synchronous')
result = computation.compute()  # Runs in single thread for debugging

Check Task Graph Size

Before computing, check the number of tasks:

print(len(ddf.__dask_graph__()))  # Should be reasonable, not millions

Validate on Small Data First

Test logic on small subset before scaling:

# Test on first partition
sample = ddf.head(1000)
# Validate results
# Then scale to full dataset

Performance Troubleshooting

Symptom: Slow Computation Start

Likely Cause: Task graph is too large Solution: Increase chunk sizes or use map_partitions

Symptom: Memory Errors

Likely Causes:

  • Chunks too large
  • Too many intermediate results
  • Memory leaks in user functions

Solutions:

  • Decrease chunk sizes
  • Use persist() strategically and delete when done
  • Profile user functions for memory issues

Symptom: Poor Parallelization

Likely Causes:

  • Data dependencies preventing parallelism
  • Chunks too large (not enough tasks)
  • GIL contention with threads on Python code

Solutions:

  • Restructure computation to reduce dependencies
  • Increase number of partitions
  • Switch to multiprocessing scheduler for Python code