zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

7.1 KiB

Raw Blame History

Dask Best Practices

Performance Optimization Principles

Start with Simpler Solutions First

Before implementing parallel computing with Dask, explore these alternatives:

Better algorithms for the specific problem
Efficient file formats (Parquet, HDF5, Zarr instead of CSV)
Compiled code via Numba or Cython
Data sampling for development and testing

These alternatives often provide better returns than distributed systems and should be exhausted before scaling to parallel computing.

Chunk Size Strategy

Critical Rule: Chunks should be small enough that many fit in a worker's available memory at once.

Recommended Target: Size chunks so workers can hold 10 chunks per core without exceeding available memory.

Why It Matters:

Too large chunks: Memory overflow and inefficient parallelization
Too small chunks: Excessive scheduling overhead

Example Calculation:

8 cores with 32 GB RAM
Target: ~400 MB per chunk (32 GB / 8 cores / 10 chunks)

Monitor with the Dashboard

The Dask dashboard provides essential visibility into:

Worker states and resource utilization
Task progress and bottlenecks
Memory usage patterns
Performance characteristics

Access the dashboard to understand what's actually slow in parallel workloads rather than guessing at optimizations.

Critical Pitfalls to Avoid

1. Don't Create Large Objects Locally Before Dask

Wrong Approach:

import pandas as pd
import dask.dataframe as dd

# Loads entire dataset into memory first
df = pd.read_csv('large_file.csv')
ddf = dd.from_pandas(df, npartitions=10)

Correct Approach:

import dask.dataframe as dd

# Let Dask handle the loading
ddf = dd.read_csv('large_file.csv')

Why: Loading data with pandas or NumPy first forces the scheduler to serialize and embed those objects in task graphs, defeating the purpose of parallel computing.

Key Principle: Use Dask methods to load data and use Dask to control the results.

2. Avoid Repeated compute() Calls

Wrong Approach:

results = []
for item in items:
    result = dask_computation(item).compute()  # Each compute is separate
    results.append(result)

Correct Approach:

computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)  # Single compute for all

Why: Calling compute in loops prevents Dask from:

Parallelizing different computations
Sharing intermediate results
Optimizing the overall task graph

3. Don't Build Excessively Large Task Graphs

Symptoms:

Millions of tasks in a single computation
Severe scheduling overhead
Long delays before computation starts

Solutions:

Increase chunk sizes to reduce number of tasks
Use map_partitions or map_blocks to fuse operations
Break computations into smaller pieces with intermediate persists
Consider whether the problem truly requires distributed computing

Example Using map_partitions:

# Instead of applying function to each row
ddf['result'] = ddf.apply(complex_function, axis=1)  # Many tasks

# Apply to entire partitions at once
ddf = ddf.map_partitions(lambda df: df.assign(result=complex_function(df)))

Infrastructure Considerations

Scheduler Selection

Use Threads For:

Numeric work with GIL-releasing libraries (NumPy, Pandas, scikit-learn)
Operations that benefit from shared memory
Single-machine workloads with array/dataframe operations

Use Processes For:

Text processing and Python collection operations
Pure Python code that's GIL-bound
Operations that need process isolation

Use Distributed Scheduler For:

Multi-machine clusters
Need for diagnostic dashboard
Asynchronous APIs
Better data locality handling

Thread Configuration

Recommendation: Aim for roughly 4 threads per process on numeric workloads.

Rationale:

Balance between parallelism and overhead
Allows efficient use of CPU cores
Reduces context switching costs

Memory Management

Persist Strategically:

# Persist intermediate results that are reused
intermediate = expensive_computation(data).persist()
result1 = intermediate.operation1().compute()
result2 = intermediate.operation2().compute()

Clear Memory When Done:

# Explicitly delete large objects
del intermediate

Data Loading Best Practices

Use Appropriate File Formats

For Tabular Data:

Parquet: Columnar, compressed, fast filtering
CSV: Only for small data or initial ingestion

For Array Data:

HDF5: Good for numeric arrays
Zarr: Cloud-native, parallel-friendly
NetCDF: Scientific data with metadata

Optimize Data Ingestion

Read Multiple Files Efficiently:

# Use glob patterns to read multiple files in parallel
ddf = dd.read_parquet('data/year=2024/month=*/day=*.parquet')

Specify Useful Columns Early:

# Only read needed columns
ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2', 'col3'])

Common Patterns and Solutions

Pattern: Embarrassingly Parallel Problems

For independent computations, use Futures:

from dask.distributed import Client

client = Client()
futures = [client.submit(func, arg) for arg in args]
results = client.gather(futures)

Pattern: Data Preprocessing Pipeline

Use Bags for initial ETL, then convert to structured formats:

import dask.bag as db

# Process raw JSON
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'success')

# Convert to DataFrame for analysis
ddf = bag.to_dataframe()

Pattern: Iterative Algorithms

Persist data between iterations:

data = dd.read_parquet('data.parquet')
data = data.persist()  # Keep in memory across iterations

for iteration in range(num_iterations):
    data = update_function(data)
    data = data.persist()  # Persist updated version

Debugging Tips

Use Single-Threaded Scheduler

For debugging with pdb or detailed error inspection:

import dask

dask.config.set(scheduler='synchronous')
result = computation.compute()  # Runs in single thread for debugging

Check Task Graph Size

Before computing, check the number of tasks:

print(len(ddf.__dask_graph__()))  # Should be reasonable, not millions

Validate on Small Data First

Test logic on small subset before scaling:

# Test on first partition
sample = ddf.head(1000)
# Validate results
# Then scale to full dataset

Performance Troubleshooting

Symptom: Slow Computation Start

Likely Cause: Task graph is too large Solution: Increase chunk sizes or use map_partitions

Symptom: Memory Errors

Likely Causes:

Chunks too large
Too many intermediate results
Memory leaks in user functions

Solutions:

Decrease chunk sizes
Use persist() strategically and delete when done
Profile user functions for memory issues

Symptom: Poor Parallelization

Likely Causes:

Data dependencies preventing parallelism
Chunks too large (not enough tasks)
GIL contention with threads on Python code

Solutions:

Restructure computation to reduce dependencies
Increase number of partitions
Switch to multiprocessing scheduler for Python code

7.1 KiB Raw Blame History

Dask Best Practices

Performance Optimization Principles

Start with Simpler Solutions First

Chunk Size Strategy

Monitor with the Dashboard

Critical Pitfalls to Avoid

1. Don't Create Large Objects Locally Before Dask

2. Avoid Repeated compute() Calls

3. Don't Build Excessively Large Task Graphs

Infrastructure Considerations

Scheduler Selection

Thread Configuration

Memory Management

Data Loading Best Practices

Use Appropriate File Formats

Optimize Data Ingestion

Common Patterns and Solutions

Pattern: Embarrassingly Parallel Problems

Pattern: Data Preprocessing Pipeline

Pattern: Iterative Algorithms

Debugging Tips

Use Single-Threaded Scheduler

Check Task Graph Size

Validate on Small Data First

Performance Troubleshooting

Symptom: Slow Computation Start

Symptom: Memory Errors

Symptom: Poor Parallelization

7.1 KiB

Raw Blame History