Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/dask/references/best-practices.md
+++ b/skills/dask/references/best-practices.md
@@ -0,0 +1,277 @@
+# Dask Best Practices
+
+## Performance Optimization Principles
+
+### Start with Simpler Solutions First
+
+Before implementing parallel computing with Dask, explore these alternatives:
+- Better algorithms for the specific problem
+- Efficient file formats (Parquet, HDF5, Zarr instead of CSV)
+- Compiled code via Numba or Cython
+- Data sampling for development and testing
+
+These alternatives often provide better returns than distributed systems and should be exhausted before scaling to parallel computing.
+
+### Chunk Size Strategy
+
+**Critical Rule**: Chunks should be small enough that many fit in a worker's available memory at once.
+
+**Recommended Target**: Size chunks so workers can hold 10 chunks per core without exceeding available memory.
+
+**Why It Matters**:
+- Too large chunks: Memory overflow and inefficient parallelization
+- Too small chunks: Excessive scheduling overhead
+
+**Example Calculation**:
+- 8 cores with 32 GB RAM
+- Target: ~400 MB per chunk (32 GB / 8 cores / 10 chunks)
+
+### Monitor with the Dashboard
+
+The Dask dashboard provides essential visibility into:
+- Worker states and resource utilization
+- Task progress and bottlenecks
+- Memory usage patterns
+- Performance characteristics
+
+Access the dashboard to understand what's actually slow in parallel workloads rather than guessing at optimizations.
+
+## Critical Pitfalls to Avoid
+
+### 1. Don't Create Large Objects Locally Before Dask
+
+**Wrong Approach**:
+```python
+import pandas as pd
+import dask.dataframe as dd
+
+# Loads entire dataset into memory first
+df = pd.read_csv('large_file.csv')
+ddf = dd.from_pandas(df, npartitions=10)
+```
+
+**Correct Approach**:
+```python
+import dask.dataframe as dd
+
+# Let Dask handle the loading
+ddf = dd.read_csv('large_file.csv')
+```
+
+**Why**: Loading data with pandas or NumPy first forces the scheduler to serialize and embed those objects in task graphs, defeating the purpose of parallel computing.
+
+**Key Principle**: Use Dask methods to load data and use Dask to control the results.
+
+### 2. Avoid Repeated compute() Calls
+
+**Wrong Approach**:
+```python
+results = []
+for item in items:
+    result = dask_computation(item).compute()  # Each compute is separate
+    results.append(result)
+```
+
+**Correct Approach**:
+```python
+computations = [dask_computation(item) for item in items]
+results = dask.compute(*computations)  # Single compute for all
+```
+
+**Why**: Calling compute in loops prevents Dask from:
+- Parallelizing different computations
+- Sharing intermediate results
+- Optimizing the overall task graph
+
+### 3. Don't Build Excessively Large Task Graphs
+
+**Symptoms**:
+- Millions of tasks in a single computation
+- Severe scheduling overhead
+- Long delays before computation starts
+
+**Solutions**:
+- Increase chunk sizes to reduce number of tasks
+- Use `map_partitions` or `map_blocks` to fuse operations
+- Break computations into smaller pieces with intermediate persists
+- Consider whether the problem truly requires distributed computing
+
+**Example Using map_partitions**:
+```python
+# Instead of applying function to each row
+ddf['result'] = ddf.apply(complex_function, axis=1)  # Many tasks
+
+# Apply to entire partitions at once
+ddf = ddf.map_partitions(lambda df: df.assign(result=complex_function(df)))
+```
+
+## Infrastructure Considerations
+
+### Scheduler Selection
+
+**Use Threads For**:
+- Numeric work with GIL-releasing libraries (NumPy, Pandas, scikit-learn)
+- Operations that benefit from shared memory
+- Single-machine workloads with array/dataframe operations
+
+**Use Processes For**:
+- Text processing and Python collection operations
+- Pure Python code that's GIL-bound
+- Operations that need process isolation
+
+**Use Distributed Scheduler For**:
+- Multi-machine clusters
+- Need for diagnostic dashboard
+- Asynchronous APIs
+- Better data locality handling
+
+### Thread Configuration
+
+**Recommendation**: Aim for roughly 4 threads per process on numeric workloads.
+
+**Rationale**:
+- Balance between parallelism and overhead
+- Allows efficient use of CPU cores
+- Reduces context switching costs
+
+### Memory Management
+
+**Persist Strategically**:
+```python
+# Persist intermediate results that are reused
+intermediate = expensive_computation(data).persist()
+result1 = intermediate.operation1().compute()
+result2 = intermediate.operation2().compute()
+```
+
+**Clear Memory When Done**:
+```python
+# Explicitly delete large objects
+del intermediate
+```
+
+## Data Loading Best Practices
+
+### Use Appropriate File Formats
+
+**For Tabular Data**:
+- Parquet: Columnar, compressed, fast filtering
+- CSV: Only for small data or initial ingestion
+
+**For Array Data**:
+- HDF5: Good for numeric arrays
+- Zarr: Cloud-native, parallel-friendly
+- NetCDF: Scientific data with metadata
+
+### Optimize Data Ingestion
+
+**Read Multiple Files Efficiently**:
+```python
+# Use glob patterns to read multiple files in parallel
+ddf = dd.read_parquet('data/year=2024/month=*/day=*.parquet')
+```
+
+**Specify Useful Columns Early**:
+```python
+# Only read needed columns
+ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2', 'col3'])
+```
+
+## Common Patterns and Solutions
+
+### Pattern: Embarrassingly Parallel Problems
+
+For independent computations, use Futures:
+```python
+from dask.distributed import Client
+
+client = Client()
+futures = [client.submit(func, arg) for arg in args]
+results = client.gather(futures)
+```
+
+### Pattern: Data Preprocessing Pipeline
+
+Use Bags for initial ETL, then convert to structured formats:
+```python
+import dask.bag as db
+
+# Process raw JSON
+bag = db.read_text('logs/*.json').map(json.loads)
+bag = bag.filter(lambda x: x['status'] == 'success')
+
+# Convert to DataFrame for analysis
+ddf = bag.to_dataframe()
+```
+
+### Pattern: Iterative Algorithms
+
+Persist data between iterations:
+```python
+data = dd.read_parquet('data.parquet')
+data = data.persist()  # Keep in memory across iterations
+
+for iteration in range(num_iterations):
+    data = update_function(data)
+    data = data.persist()  # Persist updated version
+```
+
+## Debugging Tips
+
+### Use Single-Threaded Scheduler
+
+For debugging with pdb or detailed error inspection:
+```python
+import dask
+
+dask.config.set(scheduler='synchronous')
+result = computation.compute()  # Runs in single thread for debugging
+```
+
+### Check Task Graph Size
+
+Before computing, check the number of tasks:
+```python
+print(len(ddf.__dask_graph__()))  # Should be reasonable, not millions
+```
+
+### Validate on Small Data First
+
+Test logic on small subset before scaling:
+```python
+# Test on first partition
+sample = ddf.head(1000)
+# Validate results
+# Then scale to full dataset
+```
+
+## Performance Troubleshooting
+
+### Symptom: Slow Computation Start
+
+**Likely Cause**: Task graph is too large
+**Solution**: Increase chunk sizes or use map_partitions
+
+### Symptom: Memory Errors
+
+**Likely Causes**:
+- Chunks too large
+- Too many intermediate results
+- Memory leaks in user functions
+
+**Solutions**:
+- Decrease chunk sizes
+- Use persist() strategically and delete when done
+- Profile user functions for memory issues
+
+### Symptom: Poor Parallelization
+
+**Likely Causes**:
+- Data dependencies preventing parallelism
+- Chunks too large (not enough tasks)
+- GIL contention with threads on Python code
+
+**Solutions**:
+- Restructure computation to reduce dependencies
+- Increase number of partitions
+- Switch to multiprocessing scheduler for Python code