Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/dask/references/arrays.md
+++ b/skills/dask/references/arrays.md
@@ -0,0 +1,497 @@
+# Dask Arrays
+
+## Overview
+
+Dask Array implements NumPy's ndarray interface using blocked algorithms. It coordinates many NumPy arrays arranged into a grid to enable computation on datasets larger than available memory, utilizing parallelism across multiple cores.
+
+## Core Concept
+
+A Dask Array is divided into chunks (blocks):
+- Each chunk is a regular NumPy array
+- Operations are applied to each chunk in parallel
+- Results are combined automatically
+- Enables out-of-core computation (data larger than RAM)
+
+## Key Capabilities
+
+### What Dask Arrays Support
+
+**Mathematical Operations**:
+- Arithmetic operations (+, -, *, /)
+- Scalar functions (exponentials, logarithms, trigonometric)
+- Element-wise operations
+
+**Reductions**:
+- `sum()`, `mean()`, `std()`, `var()`
+- Reductions along specified axes
+- `min()`, `max()`, `argmin()`, `argmax()`
+
+**Linear Algebra**:
+- Tensor contractions
+- Dot products and matrix multiplication
+- Some decompositions (SVD, QR)
+
+**Data Manipulation**:
+- Transposition
+- Slicing (standard and fancy indexing)
+- Reshaping
+- Concatenation and stacking
+
+**Array Protocols**:
+- Universal functions (ufuncs)
+- NumPy protocols for interoperability
+
+## When to Use Dask Arrays
+
+**Use Dask Arrays When**:
+- Arrays exceed available RAM
+- Computation can be parallelized across chunks
+- Working with NumPy-style numerical operations
+- Need to scale NumPy code to larger datasets
+
+**Stick with NumPy When**:
+- Arrays fit comfortably in memory
+- Operations require global views of data
+- Using specialized functions not available in Dask
+- Performance is adequate with NumPy alone
+
+## Important Limitations
+
+Dask Arrays intentionally don't implement certain NumPy features:
+
+**Not Implemented**:
+- Most `np.linalg` functions (only basic operations available)
+- Operations difficult to parallelize (like full sorting)
+- Memory-inefficient operations (converting to lists, iterating via loops)
+- Many specialized functions (driven by community needs)
+
+**Workarounds**: For unsupported operations, consider using `map_blocks` with custom NumPy code.
+
+## Creating Dask Arrays
+
+### From NumPy Arrays
+```python
+import dask.array as da
+import numpy as np
+
+# Create from NumPy array with specified chunks
+x = np.arange(10000)
+dx = da.from_array(x, chunks=1000)  # Creates 10 chunks of 1000 elements each
+```
+
+### Random Arrays
+```python
+# Create random array with specified chunks
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+
+# Other random functions
+x = da.random.normal(10, 0.1, size=(10000, 10000), chunks=(1000, 1000))
+```
+
+### Zeros, Ones, and Empty
+```python
+# Create arrays filled with constants
+zeros = da.zeros((10000, 10000), chunks=(1000, 1000))
+ones = da.ones((10000, 10000), chunks=(1000, 1000))
+empty = da.empty((10000, 10000), chunks=(1000, 1000))
+```
+
+### From Functions
+```python
+# Create array from function
+def create_block(block_id):
+    return np.random.random((1000, 1000)) * block_id[0]
+
+x = da.from_delayed(
+    [[dask.delayed(create_block)((i, j)) for j in range(10)] for i in range(10)],
+    shape=(10000, 10000),
+    dtype=float
+)
+```
+
+### From Disk
+```python
+# Load from HDF5
+import h5py
+f = h5py.File('myfile.hdf5', mode='r')
+x = da.from_array(f['/data'], chunks=(1000, 1000))
+
+# Load from Zarr
+import zarr
+z = zarr.open('myfile.zarr', mode='r')
+x = da.from_array(z, chunks=(1000, 1000))
+```
+
+## Common Operations
+
+### Arithmetic Operations
+```python
+import dask.array as da
+
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+y = da.random.random((10000, 10000), chunks=(1000, 1000))
+
+# Element-wise operations (lazy)
+z = x + y
+z = x * y
+z = da.exp(x)
+z = da.log(y)
+
+# Compute result
+result = z.compute()
+```
+
+### Reductions
+```python
+# Reductions along axes
+total = x.sum().compute()
+mean = x.mean().compute()
+std = x.std().compute()
+
+# Reduction along specific axis
+row_means = x.mean(axis=1).compute()
+col_sums = x.sum(axis=0).compute()
+```
+
+### Slicing and Indexing
+```python
+# Standard slicing (returns Dask Array)
+subset = x[1000:5000, 2000:8000]
+
+# Fancy indexing
+indices = [0, 5, 10, 15]
+selected = x[indices, :]
+
+# Boolean indexing
+mask = x > 0.5
+filtered = x[mask]
+```
+
+### Matrix Operations
+```python
+# Matrix multiplication
+A = da.random.random((10000, 5000), chunks=(1000, 1000))
+B = da.random.random((5000, 8000), chunks=(1000, 1000))
+C = da.matmul(A, B)
+result = C.compute()
+
+# Dot product
+dot_product = da.dot(A, B)
+
+# Transpose
+AT = A.T
+```
+
+### Linear Algebra
+```python
+# SVD (Singular Value Decomposition)
+U, s, Vt = da.linalg.svd(A)
+U_computed, s_computed, Vt_computed = dask.compute(U, s, Vt)
+
+# QR decomposition
+Q, R = da.linalg.qr(A)
+Q_computed, R_computed = dask.compute(Q, R)
+
+# Note: Only some linalg operations are available
+```
+
+### Reshaping and Manipulation
+```python
+# Reshape
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+reshaped = x.reshape(5000, 20000)
+
+# Transpose
+transposed = x.T
+
+# Concatenate
+x1 = da.random.random((5000, 10000), chunks=(1000, 1000))
+x2 = da.random.random((5000, 10000), chunks=(1000, 1000))
+combined = da.concatenate([x1, x2], axis=0)
+
+# Stack
+stacked = da.stack([x1, x2], axis=0)
+```
+
+## Chunking Strategy
+
+Chunking is critical for Dask Array performance.
+
+### Chunk Size Guidelines
+
+**Good Chunk Sizes**:
+- Each chunk: ~10-100 MB (compressed)
+- ~1 million elements per chunk for numeric data
+- Balance between parallelism and overhead
+
+**Example Calculation**:
+```python
+# For float64 data (8 bytes per element)
+# Target 100 MB chunks: 100 MB / 8 bytes = 12.5M elements
+
+# For 2D array (10000, 10000):
+x = da.random.random((10000, 10000), chunks=(1000, 1000))  # ~8 MB per chunk
+```
+
+### Viewing Chunk Structure
+```python
+# Check chunks
+print(x.chunks)  # ((1000, 1000, ...), (1000, 1000, ...))
+
+# Number of chunks
+print(x.npartitions)
+
+# Chunk sizes in bytes
+print(x.nbytes / x.npartitions)
+```
+
+### Rechunking
+```python
+# Change chunk sizes
+x = da.random.random((10000, 10000), chunks=(500, 500))
+x_rechunked = x.rechunk((2000, 2000))
+
+# Rechunk specific dimension
+x_rechunked = x.rechunk({0: 2000, 1: 'auto'})
+```
+
+## Custom Operations with map_blocks
+
+For operations not available in Dask, use `map_blocks`:
+
+```python
+import dask.array as da
+import numpy as np
+
+def custom_function(block):
+    # Apply custom NumPy operation
+    return np.fft.fft2(block)
+
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+result = da.map_blocks(custom_function, x, dtype=x.dtype)
+
+# Compute
+output = result.compute()
+```
+
+### map_blocks with Different Output Shape
+```python
+def reduction_function(block):
+    # Returns scalar for each block
+    return np.array([block.mean()])
+
+result = da.map_blocks(
+    reduction_function,
+    x,
+    dtype='float64',
+    drop_axis=[0, 1],  # Output has no axes from input
+    new_axis=0,        # Output has new axis
+    chunks=(1,)        # One element per block
+)
+```
+
+## Lazy Evaluation and Computation
+
+### Lazy Operations
+```python
+# All operations are lazy (instant, no computation)
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+y = x + 100
+z = y.mean(axis=0)
+result = z * 2
+
+# Nothing computed yet, just task graph built
+```
+
+### Triggering Computation
+```python
+# Compute single result
+final = result.compute()
+
+# Compute multiple results efficiently
+result1, result2 = dask.compute(operation1, operation2)
+```
+
+### Persist in Memory
+```python
+# Keep intermediate results in memory
+x_cached = x.persist()
+
+# Reuse cached results
+y1 = (x_cached + 10).compute()
+y2 = (x_cached * 2).compute()
+```
+
+## Saving Results
+
+### To NumPy
+```python
+# Convert to NumPy (loads all in memory)
+numpy_array = dask_array.compute()
+```
+
+### To Disk
+```python
+# Save to HDF5
+import h5py
+with h5py.File('output.hdf5', mode='w') as f:
+    dset = f.create_dataset('/data', shape=x.shape, dtype=x.dtype)
+    da.store(x, dset)
+
+# Save to Zarr
+import zarr
+z = zarr.open('output.zarr', mode='w', shape=x.shape, dtype=x.dtype, chunks=x.chunks)
+da.store(x, z)
+```
+
+## Performance Considerations
+
+### Efficient Operations
+- Element-wise operations: Very efficient
+- Reductions with parallelizable operations: Efficient
+- Slicing along chunk boundaries: Efficient
+- Matrix operations with good chunk alignment: Efficient
+
+### Expensive Operations
+- Slicing across many chunks: Requires data movement
+- Operations requiring global sorting: Not well supported
+- Extremely irregular access patterns: Poor performance
+- Operations with poor chunk alignment: Requires rechunking
+
+### Optimization Tips
+
+**1. Choose Good Chunk Sizes**
+```python
+# Aim for balanced chunks
+# Good: ~100 MB per chunk
+x = da.random.random((100000, 10000), chunks=(10000, 10000))
+```
+
+**2. Align Chunks for Operations**
+```python
+# Make sure chunks align for operations
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+y = da.random.random((10000, 10000), chunks=(1000, 1000))  # Aligned
+z = x + y  # Efficient
+```
+
+**3. Use Appropriate Scheduler**
+```python
+# Arrays work well with threaded scheduler (default)
+# Shared memory access is efficient
+result = x.compute()  # Uses threads by default
+```
+
+**4. Minimize Data Transfer**
+```python
+# Better: Compute on each chunk, then transfer results
+means = x.mean(axis=1).compute()  # Transfers less data
+
+# Worse: Transfer all data then compute
+x_numpy = x.compute()
+means = x_numpy.mean(axis=1)  # Transfers more data
+```
+
+## Common Patterns
+
+### Image Processing
+```python
+import dask.array as da
+
+# Load large image stack
+images = da.from_zarr('images.zarr')
+
+# Apply filtering
+def apply_gaussian(block):
+    from scipy.ndimage import gaussian_filter
+    return gaussian_filter(block, sigma=2)
+
+filtered = da.map_blocks(apply_gaussian, images, dtype=images.dtype)
+
+# Compute statistics
+mean_intensity = filtered.mean().compute()
+```
+
+### Scientific Computing
+```python
+# Large-scale numerical simulation
+x = da.random.random((100000, 100000), chunks=(10000, 10000))
+
+# Apply iterative computation
+for i in range(num_iterations):
+    x = da.exp(-x) * da.sin(x)
+    x = x.persist()  # Keep in memory for next iteration
+
+# Final result
+result = x.compute()
+```
+
+### Data Analysis
+```python
+# Load large dataset
+data = da.from_zarr('measurements.zarr')
+
+# Compute statistics
+mean = data.mean(axis=0)
+std = data.std(axis=0)
+normalized = (data - mean) / std
+
+# Save normalized data
+da.to_zarr(normalized, 'normalized.zarr')
+```
+
+## Integration with Other Tools
+
+### XArray
+```python
+import xarray as xr
+import dask.array as da
+
+# XArray wraps Dask arrays with labeled dimensions
+data = da.random.random((1000, 2000, 3000), chunks=(100, 200, 300))
+dataset = xr.DataArray(
+    data,
+    dims=['time', 'y', 'x'],
+    coords={'time': range(1000), 'y': range(2000), 'x': range(3000)}
+)
+```
+
+### Scikit-learn (via Dask-ML)
+```python
+# Some scikit-learn compatible operations
+from dask_ml.preprocessing import StandardScaler
+
+X = da.random.random((10000, 100), chunks=(1000, 100))
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+```
+
+## Debugging Tips
+
+### Visualize Task Graph
+```python
+# Visualize computation graph (for small arrays)
+x = da.random.random((100, 100), chunks=(10, 10))
+y = x + 1
+y.visualize(filename='graph.png')
+```
+
+### Check Array Properties
+```python
+# Inspect before computing
+print(f"Shape: {x.shape}")
+print(f"Dtype: {x.dtype}")
+print(f"Chunks: {x.chunks}")
+print(f"Number of tasks: {len(x.__dask_graph__())}")
+```
+
+### Test on Small Arrays First
+```python
+# Test logic on small array
+small_x = da.random.random((100, 100), chunks=(50, 50))
+result_small = computation(small_x).compute()
+
+# Validate, then scale
+large_x = da.random.random((100000, 100000), chunks=(10000, 10000))
+result_large = computation(large_x).compute()
+```
--- a/skills/dask/references/bags.md
+++ b/skills/dask/references/bags.md
@@ -0,0 +1,468 @@
+# Dask Bags
+
+## Overview
+
+Dask Bag implements functional operations including `map`, `filter`, `fold`, and `groupby` on generic Python objects. It processes data in parallel while maintaining a small memory footprint through Python iterators. Bags function as "a parallel version of PyToolz or a Pythonic version of the PySpark RDD."
+
+## Core Concept
+
+A Dask Bag is a collection of Python objects distributed across partitions:
+- Each partition contains generic Python objects
+- Operations use functional programming patterns
+- Processing uses streaming/iterators for memory efficiency
+- Ideal for unstructured or semi-structured data
+
+## Key Capabilities
+
+### Functional Operations
+- `map`: Transform each element
+- `filter`: Select elements based on condition
+- `fold`: Reduce elements with combining function
+- `groupby`: Group elements by key
+- `pluck`: Extract fields from records
+- `flatten`: Flatten nested structures
+
+### Use Cases
+- Text processing and log analysis
+- JSON record processing
+- ETL on unstructured data
+- Data cleaning before structured analysis
+
+## When to Use Dask Bags
+
+**Use Bags When**:
+- Working with general Python objects requiring flexible computation
+- Data doesn't fit structured array or tabular formats
+- Processing text, JSON, or custom Python objects
+- Initial data cleaning and ETL is needed
+- Memory-efficient streaming is important
+
+**Use Other Collections When**:
+- Data is structured (use DataFrames instead)
+- Numeric computing (use Arrays instead)
+- Operations require complex groupby or shuffles (use DataFrames)
+
+**Key Recommendation**: Use Bag to clean and process data, then transform it into an array or DataFrame before embarking on more complex operations that require shuffle steps.
+
+## Important Limitations
+
+Bags sacrifice performance for generality:
+- Rely on multiprocessing scheduling (not threads)
+- Remain immutable (create new bags for changes)
+- Operate slower than array/DataFrame equivalents
+- Handle `groupby` inefficiently (use `foldby` when possible)
+- Operations requiring substantial inter-worker communication are slow
+
+## Creating Bags
+
+### From Sequences
+```python
+import dask.bag as db
+
+# From Python list
+bag = db.from_sequence([1, 2, 3, 4, 5], partition_size=2)
+
+# From range
+bag = db.from_sequence(range(10000), partition_size=1000)
+```
+
+### From Text Files
+```python
+# Single file
+bag = db.read_text('data.txt')
+
+# Multiple files with glob
+bag = db.read_text('data/*.txt')
+
+# With encoding
+bag = db.read_text('data/*.txt', encoding='utf-8')
+
+# Custom line processing
+bag = db.read_text('logs/*.log', blocksize='64MB')
+```
+
+### From Delayed Objects
+```python
+import dask
+
+@dask.delayed
+def load_data(filename):
+    with open(filename) as f:
+        return [line.strip() for line in f]
+
+files = ['file1.txt', 'file2.txt', 'file3.txt']
+partitions = [load_data(f) for f in files]
+bag = db.from_delayed(partitions)
+```
+
+### From Custom Sources
+```python
+# From any iterable-producing function
+def read_json_files():
+    import json
+    for filename in glob.glob('data/*.json'):
+        with open(filename) as f:
+            yield json.load(f)
+
+# Create bag from generator
+bag = db.from_sequence(read_json_files(), partition_size=10)
+```
+
+## Common Operations
+
+### Map (Transform)
+```python
+import dask.bag as db
+
+bag = db.read_text('data/*.json')
+
+# Parse JSON
+import json
+parsed = bag.map(json.loads)
+
+# Extract field
+values = parsed.map(lambda x: x['value'])
+
+# Complex transformation
+def process_record(record):
+    return {
+        'id': record['id'],
+        'value': record['value'] * 2,
+        'category': record.get('category', 'unknown')
+    }
+
+processed = parsed.map(process_record)
+```
+
+### Filter
+```python
+# Filter by condition
+valid = parsed.filter(lambda x: x['status'] == 'valid')
+
+# Multiple conditions
+filtered = parsed.filter(lambda x: x['value'] > 100 and x['year'] == 2024)
+
+# Filter with custom function
+def is_valid_record(record):
+    return record.get('status') == 'valid' and record.get('value') is not None
+
+valid_records = parsed.filter(is_valid_record)
+```
+
+### Pluck (Extract Fields)
+```python
+# Extract single field
+ids = parsed.pluck('id')
+
+# Extract multiple fields (creates tuples)
+key_pairs = parsed.pluck(['id', 'value'])
+```
+
+### Flatten
+```python
+# Flatten nested lists
+nested = db.from_sequence([[1, 2], [3, 4], [5, 6]])
+flat = nested.flatten()  # [1, 2, 3, 4, 5, 6]
+
+# Flatten after map
+bag = db.read_text('data/*.txt')
+words = bag.map(str.split).flatten()  # All words from all files
+```
+
+### GroupBy (Expensive)
+```python
+# Group by key (requires shuffle)
+grouped = parsed.groupby(lambda x: x['category'])
+
+# Aggregate after grouping
+counts = grouped.map(lambda key_items: (key_items[0], len(list(key_items[1]))))
+result = counts.compute()
+```
+
+### FoldBy (Preferred for Aggregations)
+```python
+# FoldBy is more efficient than groupby for aggregations
+def add(acc, item):
+    return acc + item['value']
+
+def combine(acc1, acc2):
+    return acc1 + acc2
+
+# Sum values by category
+sums = parsed.foldby(
+    key='category',
+    binop=add,
+    initial=0,
+    combine=combine
+)
+
+result = sums.compute()
+```
+
+### Reductions
+```python
+# Count elements
+count = bag.count().compute()
+
+# Get all distinct values (requires memory)
+distinct = bag.distinct().compute()
+
+# Take first n elements
+first_ten = bag.take(10)
+
+# Fold/reduce
+total = bag.fold(
+    lambda acc, x: acc + x['value'],
+    initial=0,
+    combine=lambda a, b: a + b
+).compute()
+```
+
+## Converting to Other Collections
+
+### To DataFrame
+```python
+import dask.bag as db
+import dask.dataframe as dd
+
+# Bag of dictionaries
+bag = db.read_text('data/*.json').map(json.loads)
+
+# Convert to DataFrame
+ddf = bag.to_dataframe()
+
+# With explicit columns
+ddf = bag.to_dataframe(meta={'id': int, 'value': float, 'category': str})
+```
+
+### To List/Compute
+```python
+# Compute to Python list (loads all in memory)
+result = bag.compute()
+
+# Take sample
+sample = bag.take(100)
+```
+
+## Common Patterns
+
+### JSON Processing
+```python
+import dask.bag as db
+import json
+
+# Read and parse JSON files
+bag = db.read_text('logs/*.json')
+parsed = bag.map(json.loads)
+
+# Filter valid records
+valid = parsed.filter(lambda x: x.get('status') == 'success')
+
+# Extract relevant fields
+processed = valid.map(lambda x: {
+    'user_id': x['user']['id'],
+    'timestamp': x['timestamp'],
+    'value': x['metrics']['value']
+})
+
+# Convert to DataFrame for analysis
+ddf = processed.to_dataframe()
+
+# Analyze
+summary = ddf.groupby('user_id')['value'].mean().compute()
+```
+
+### Log Analysis
+```python
+# Read log files
+logs = db.read_text('logs/*.log')
+
+# Parse log lines
+def parse_log_line(line):
+    parts = line.split(' ')
+    return {
+        'timestamp': parts[0],
+        'level': parts[1],
+        'message': ' '.join(parts[2:])
+    }
+
+parsed_logs = logs.map(parse_log_line)
+
+# Filter errors
+errors = parsed_logs.filter(lambda x: x['level'] == 'ERROR')
+
+# Count by message pattern
+error_counts = errors.foldby(
+    key='message',
+    binop=lambda acc, x: acc + 1,
+    initial=0,
+    combine=lambda a, b: a + b
+)
+
+result = error_counts.compute()
+```
+
+### Text Processing
+```python
+# Read text files
+text = db.read_text('documents/*.txt')
+
+# Split into words
+words = text.map(str.lower).map(str.split).flatten()
+
+# Count word frequencies
+def increment(acc, word):
+    return acc + 1
+
+def combine_counts(a, b):
+    return a + b
+
+word_counts = words.foldby(
+    key=lambda word: word,
+    binop=increment,
+    initial=0,
+    combine=combine_counts
+)
+
+# Get top words
+top_words = word_counts.compute()
+sorted_words = sorted(top_words, key=lambda x: x[1], reverse=True)[:100]
+```
+
+### Data Cleaning Pipeline
+```python
+import dask.bag as db
+import json
+
+# Read raw data
+raw = db.read_text('raw_data/*.json').map(json.loads)
+
+# Validation function
+def is_valid(record):
+    required_fields = ['id', 'timestamp', 'value']
+    return all(field in record for field in required_fields)
+
+# Cleaning function
+def clean_record(record):
+    return {
+        'id': int(record['id']),
+        'timestamp': record['timestamp'],
+        'value': float(record['value']),
+        'category': record.get('category', 'unknown'),
+        'tags': record.get('tags', [])
+    }
+
+# Pipeline
+cleaned = (raw
+    .filter(is_valid)
+    .map(clean_record)
+    .filter(lambda x: x['value'] > 0)
+)
+
+# Convert to DataFrame
+ddf = cleaned.to_dataframe()
+
+# Save cleaned data
+ddf.to_parquet('cleaned_data/')
+```
+
+## Performance Considerations
+
+### Efficient Operations
+- Map, filter, pluck: Very efficient (streaming)
+- Flatten: Efficient
+- FoldBy with good key distribution: Reasonable
+- Take and head: Efficient (only processes needed partitions)
+
+### Expensive Operations
+- GroupBy: Requires shuffle, can be slow
+- Distinct: Requires collecting all unique values
+- Operations requiring full data materialization
+
+### Optimization Tips
+
+**1. Use FoldBy Instead of GroupBy**
+```python
+# Better: Use foldby for aggregations
+result = bag.foldby(key='category', binop=add, initial=0, combine=sum)
+
+# Worse: GroupBy then reduce
+result = bag.groupby('category').map(lambda x: (x[0], sum(x[1])))
+```
+
+**2. Convert to DataFrame Early**
+```python
+# For structured operations, convert to DataFrame
+bag = db.read_text('data/*.json').map(json.loads)
+bag = bag.filter(lambda x: x['status'] == 'valid')
+ddf = bag.to_dataframe()  # Now use efficient DataFrame operations
+```
+
+**3. Control Partition Size**
+```python
+# Balance between too many and too few partitions
+bag = db.read_text('data/*.txt', blocksize='64MB')  # Reasonable partition size
+```
+
+**4. Use Lazy Evaluation**
+```python
+# Chain operations before computing
+result = (bag
+    .map(process1)
+    .filter(condition)
+    .map(process2)
+    .compute()  # Single compute at the end
+)
+```
+
+## Debugging Tips
+
+### Inspect Partitions
+```python
+# Get number of partitions
+print(bag.npartitions)
+
+# Take sample
+sample = bag.take(10)
+print(sample)
+```
+
+### Validate on Small Data
+```python
+# Test logic on small subset
+small_bag = db.from_sequence(sample_data, partition_size=10)
+result = process_pipeline(small_bag).compute()
+# Validate results, then scale
+```
+
+### Check Intermediate Results
+```python
+# Compute intermediate steps to debug
+step1 = bag.map(parse).take(5)
+print("After parsing:", step1)
+
+step2 = bag.map(parse).filter(validate).take(5)
+print("After filtering:", step2)
+```
+
+## Memory Management
+
+Bags are designed for memory-efficient processing:
+
+```python
+# Streaming processing - doesn't load all in memory
+bag = db.read_text('huge_file.txt')  # Lazy
+processed = bag.map(process_line)     # Still lazy
+result = processed.compute()          # Processes in chunks
+```
+
+For very large results, avoid computing to memory:
+
+```python
+# Don't compute huge results to memory
+# result = bag.compute()  # Could overflow memory
+
+# Instead, convert and save to disk
+ddf = bag.to_dataframe()
+ddf.to_parquet('output/')
+```
--- a/skills/dask/references/best-practices.md
+++ b/skills/dask/references/best-practices.md
@@ -0,0 +1,277 @@
+# Dask Best Practices
+
+## Performance Optimization Principles
+
+### Start with Simpler Solutions First
+
+Before implementing parallel computing with Dask, explore these alternatives:
+- Better algorithms for the specific problem
+- Efficient file formats (Parquet, HDF5, Zarr instead of CSV)
+- Compiled code via Numba or Cython
+- Data sampling for development and testing
+
+These alternatives often provide better returns than distributed systems and should be exhausted before scaling to parallel computing.
+
+### Chunk Size Strategy
+
+**Critical Rule**: Chunks should be small enough that many fit in a worker's available memory at once.
+
+**Recommended Target**: Size chunks so workers can hold 10 chunks per core without exceeding available memory.
+
+**Why It Matters**:
+- Too large chunks: Memory overflow and inefficient parallelization
+- Too small chunks: Excessive scheduling overhead
+
+**Example Calculation**:
+- 8 cores with 32 GB RAM
+- Target: ~400 MB per chunk (32 GB / 8 cores / 10 chunks)
+
+### Monitor with the Dashboard
+
+The Dask dashboard provides essential visibility into:
+- Worker states and resource utilization
+- Task progress and bottlenecks
+- Memory usage patterns
+- Performance characteristics
+
+Access the dashboard to understand what's actually slow in parallel workloads rather than guessing at optimizations.
+
+## Critical Pitfalls to Avoid
+
+### 1. Don't Create Large Objects Locally Before Dask
+
+**Wrong Approach**:
+```python
+import pandas as pd
+import dask.dataframe as dd
+
+# Loads entire dataset into memory first
+df = pd.read_csv('large_file.csv')
+ddf = dd.from_pandas(df, npartitions=10)
+```
+
+**Correct Approach**:
+```python
+import dask.dataframe as dd
+
+# Let Dask handle the loading
+ddf = dd.read_csv('large_file.csv')
+```
+
+**Why**: Loading data with pandas or NumPy first forces the scheduler to serialize and embed those objects in task graphs, defeating the purpose of parallel computing.
+
+**Key Principle**: Use Dask methods to load data and use Dask to control the results.
+
+### 2. Avoid Repeated compute() Calls
+
+**Wrong Approach**:
+```python
+results = []
+for item in items:
+    result = dask_computation(item).compute()  # Each compute is separate
+    results.append(result)
+```
+
+**Correct Approach**:
+```python
+computations = [dask_computation(item) for item in items]
+results = dask.compute(*computations)  # Single compute for all
+```
+
+**Why**: Calling compute in loops prevents Dask from:
+- Parallelizing different computations
+- Sharing intermediate results
+- Optimizing the overall task graph
+
+### 3. Don't Build Excessively Large Task Graphs
+
+**Symptoms**:
+- Millions of tasks in a single computation
+- Severe scheduling overhead
+- Long delays before computation starts
+
+**Solutions**:
+- Increase chunk sizes to reduce number of tasks
+- Use `map_partitions` or `map_blocks` to fuse operations
+- Break computations into smaller pieces with intermediate persists
+- Consider whether the problem truly requires distributed computing
+
+**Example Using map_partitions**:
+```python
+# Instead of applying function to each row
+ddf['result'] = ddf.apply(complex_function, axis=1)  # Many tasks
+
+# Apply to entire partitions at once
+ddf = ddf.map_partitions(lambda df: df.assign(result=complex_function(df)))
+```
+
+## Infrastructure Considerations
+
+### Scheduler Selection
+
+**Use Threads For**:
+- Numeric work with GIL-releasing libraries (NumPy, Pandas, scikit-learn)
+- Operations that benefit from shared memory
+- Single-machine workloads with array/dataframe operations
+
+**Use Processes For**:
+- Text processing and Python collection operations
+- Pure Python code that's GIL-bound
+- Operations that need process isolation
+
+**Use Distributed Scheduler For**:
+- Multi-machine clusters
+- Need for diagnostic dashboard
+- Asynchronous APIs
+- Better data locality handling
+
+### Thread Configuration
+
+**Recommendation**: Aim for roughly 4 threads per process on numeric workloads.
+
+**Rationale**:
+- Balance between parallelism and overhead
+- Allows efficient use of CPU cores
+- Reduces context switching costs
+
+### Memory Management
+
+**Persist Strategically**:
+```python
+# Persist intermediate results that are reused
+intermediate = expensive_computation(data).persist()
+result1 = intermediate.operation1().compute()
+result2 = intermediate.operation2().compute()
+```
+
+**Clear Memory When Done**:
+```python
+# Explicitly delete large objects
+del intermediate
+```
+
+## Data Loading Best Practices
+
+### Use Appropriate File Formats
+
+**For Tabular Data**:
+- Parquet: Columnar, compressed, fast filtering
+- CSV: Only for small data or initial ingestion
+
+**For Array Data**:
+- HDF5: Good for numeric arrays
+- Zarr: Cloud-native, parallel-friendly
+- NetCDF: Scientific data with metadata
+
+### Optimize Data Ingestion
+
+**Read Multiple Files Efficiently**:
+```python
+# Use glob patterns to read multiple files in parallel
+ddf = dd.read_parquet('data/year=2024/month=*/day=*.parquet')
+```
+
+**Specify Useful Columns Early**:
+```python
+# Only read needed columns
+ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2', 'col3'])
+```
+
+## Common Patterns and Solutions
+
+### Pattern: Embarrassingly Parallel Problems
+
+For independent computations, use Futures:
+```python
+from dask.distributed import Client
+
+client = Client()
+futures = [client.submit(func, arg) for arg in args]
+results = client.gather(futures)
+```
+
+### Pattern: Data Preprocessing Pipeline
+
+Use Bags for initial ETL, then convert to structured formats:
+```python
+import dask.bag as db
+
+# Process raw JSON
+bag = db.read_text('logs/*.json').map(json.loads)
+bag = bag.filter(lambda x: x['status'] == 'success')
+
+# Convert to DataFrame for analysis
+ddf = bag.to_dataframe()
+```
+
+### Pattern: Iterative Algorithms
+
+Persist data between iterations:
+```python
+data = dd.read_parquet('data.parquet')
+data = data.persist()  # Keep in memory across iterations
+
+for iteration in range(num_iterations):
+    data = update_function(data)
+    data = data.persist()  # Persist updated version
+```
+
+## Debugging Tips
+
+### Use Single-Threaded Scheduler
+
+For debugging with pdb or detailed error inspection:
+```python
+import dask
+
+dask.config.set(scheduler='synchronous')
+result = computation.compute()  # Runs in single thread for debugging
+```
+
+### Check Task Graph Size
+
+Before computing, check the number of tasks:
+```python
+print(len(ddf.__dask_graph__()))  # Should be reasonable, not millions
+```
+
+### Validate on Small Data First
+
+Test logic on small subset before scaling:
+```python
+# Test on first partition
+sample = ddf.head(1000)
+# Validate results
+# Then scale to full dataset
+```
+
+## Performance Troubleshooting
+
+### Symptom: Slow Computation Start
+
+**Likely Cause**: Task graph is too large
+**Solution**: Increase chunk sizes or use map_partitions
+
+### Symptom: Memory Errors
+
+**Likely Causes**:
+- Chunks too large
+- Too many intermediate results
+- Memory leaks in user functions
+
+**Solutions**:
+- Decrease chunk sizes
+- Use persist() strategically and delete when done
+- Profile user functions for memory issues
+
+### Symptom: Poor Parallelization
+
+**Likely Causes**:
+- Data dependencies preventing parallelism
+- Chunks too large (not enough tasks)
+- GIL contention with threads on Python code
+
+**Solutions**:
+- Restructure computation to reduce dependencies
+- Increase number of partitions
+- Switch to multiprocessing scheduler for Python code
--- a/skills/dask/references/dataframes.md
+++ b/skills/dask/references/dataframes.md
@@ -0,0 +1,368 @@
+# Dask DataFrames
+
+## Overview
+
+Dask DataFrames enable parallel processing of large tabular data by distributing work across multiple pandas DataFrames. As described in the documentation, "Dask DataFrames are a collection of many pandas DataFrames" with identical APIs, making the transition from pandas straightforward.
+
+## Core Concept
+
+A Dask DataFrame is divided into multiple pandas DataFrames (partitions) along the index:
+- Each partition is a regular pandas DataFrame
+- Operations are applied to each partition in parallel
+- Results are combined automatically
+
+## Key Capabilities
+
+### Scale
+- Process 100 GiB on a laptop
+- Process 100 TiB on a cluster
+- Handle datasets exceeding available RAM
+
+### Compatibility
+- Implements most of the pandas API
+- Easy transition from pandas code
+- Works with familiar operations
+
+## When to Use Dask DataFrames
+
+**Use Dask When**:
+- Dataset exceeds available RAM
+- Computations require significant time and pandas optimization hasn't helped
+- Need to scale from prototype (pandas) to production (larger data)
+- Working with multiple files that should be processed together
+
+**Stick with Pandas When**:
+- Data fits comfortably in memory
+- Computations complete in subseconds
+- Simple operations without custom `.apply()` functions
+- Iterative development and exploration
+
+## Reading Data
+
+Dask mirrors pandas reading syntax with added support for multiple files:
+
+### Single File
+```python
+import dask.dataframe as dd
+
+# Read single file
+ddf = dd.read_csv('data.csv')
+ddf = dd.read_parquet('data.parquet')
+```
+
+### Multiple Files
+```python
+# Read multiple files using glob patterns
+ddf = dd.read_csv('data/*.csv')
+ddf = dd.read_parquet('s3://mybucket/data/*.parquet')
+
+# Read with path structure
+ddf = dd.read_parquet('data/year=*/month=*/day=*.parquet')
+```
+
+### Optimizations
+```python
+# Specify columns to read (reduces memory)
+ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2'])
+
+# Control partitioning
+ddf = dd.read_csv('data.csv', blocksize='64MB')  # Creates 64MB partitions
+```
+
+## Common Operations
+
+All operations are lazy until `.compute()` is called.
+
+### Filtering
+```python
+# Same as pandas
+filtered = ddf[ddf['column'] > 100]
+filtered = ddf.query('column > 100')
+```
+
+### Column Operations
+```python
+# Add columns
+ddf['new_column'] = ddf['col1'] + ddf['col2']
+
+# Select columns
+subset = ddf[['col1', 'col2', 'col3']]
+
+# Drop columns
+ddf = ddf.drop(columns=['unnecessary_col'])
+```
+
+### Aggregations
+```python
+# Standard aggregations work as expected
+mean = ddf['column'].mean().compute()
+sum_total = ddf['column'].sum().compute()
+counts = ddf['category'].value_counts().compute()
+```
+
+### GroupBy
+```python
+# GroupBy operations (may require shuffle)
+grouped = ddf.groupby('category')['value'].mean().compute()
+
+# Multiple aggregations
+agg_result = ddf.groupby('category').agg({
+    'value': ['mean', 'sum', 'count'],
+    'amount': 'sum'
+}).compute()
+```
+
+### Joins and Merges
+```python
+# Merge DataFrames
+merged = dd.merge(ddf1, ddf2, on='key', how='left')
+
+# Join on index
+joined = ddf1.join(ddf2, on='key')
+```
+
+### Sorting
+```python
+# Sorting (expensive operation, requires data movement)
+sorted_ddf = ddf.sort_values('column')
+result = sorted_ddf.compute()
+```
+
+## Custom Operations
+
+### Apply Functions
+
+**To Partitions (Efficient)**:
+```python
+# Apply function to entire partitions
+def custom_partition_function(partition_df):
+    # partition_df is a pandas DataFrame
+    return partition_df.assign(new_col=partition_df['col1'] * 2)
+
+ddf = ddf.map_partitions(custom_partition_function)
+```
+
+**To Rows (Less Efficient)**:
+```python
+# Apply to each row (creates many tasks)
+ddf['result'] = ddf.apply(lambda row: custom_function(row), axis=1, meta=('result', 'float'))
+```
+
+**Note**: Always prefer `map_partitions` over row-wise `apply` for better performance.
+
+### Meta Parameter
+
+When Dask can't infer output structure, specify the `meta` parameter:
+```python
+# For apply operations
+ddf['new'] = ddf.apply(func, axis=1, meta=('new', 'float64'))
+
+# For map_partitions
+ddf = ddf.map_partitions(func, meta=pd.DataFrame({
+    'col1': pd.Series(dtype='float64'),
+    'col2': pd.Series(dtype='int64')
+}))
+```
+
+## Lazy Evaluation and Computation
+
+### Lazy Operations
+```python
+# These operations are lazy (instant, no computation)
+filtered = ddf[ddf['value'] > 100]
+aggregated = filtered.groupby('category').mean()
+final = aggregated[aggregated['value'] < 500]
+
+# Nothing has computed yet
+```
+
+### Triggering Computation
+```python
+# Compute single result
+result = final.compute()
+
+# Compute multiple results efficiently
+result1, result2, result3 = dask.compute(
+    operation1,
+    operation2,
+    operation3
+)
+```
+
+### Persist in Memory
+```python
+# Keep results in distributed memory for reuse
+ddf_cached = ddf.persist()
+
+# Now multiple operations on ddf_cached won't recompute
+result1 = ddf_cached.mean().compute()
+result2 = ddf_cached.sum().compute()
+```
+
+## Index Management
+
+### Setting Index
+```python
+# Set index (required for efficient joins and certain operations)
+ddf = ddf.set_index('timestamp', sorted=True)
+```
+
+### Index Properties
+- Sorted index enables efficient filtering and joins
+- Index determines partitioning
+- Some operations perform better with appropriate index
+
+## Writing Results
+
+### To Files
+```python
+# Write to multiple files (one per partition)
+ddf.to_parquet('output/data.parquet')
+ddf.to_csv('output/data-*.csv')
+
+# Write to single file (forces computation and concatenation)
+ddf.compute().to_csv('output/single_file.csv')
+```
+
+### To Memory (Pandas)
+```python
+# Convert to pandas (loads all data in memory)
+pdf = ddf.compute()
+```
+
+## Performance Considerations
+
+### Efficient Operations
+- Column selection and filtering: Very efficient
+- Simple aggregations (sum, mean, count): Efficient
+- Row-wise operations on partitions: Efficient with `map_partitions`
+
+### Expensive Operations
+- Sorting: Requires data shuffle across workers
+- GroupBy with many groups: May require shuffle
+- Complex joins: Depends on data distribution
+- Row-wise apply: Creates many tasks
+
+### Optimization Tips
+
+**1. Select Columns Early**
+```python
+# Better: Read only needed columns
+ddf = dd.read_parquet('data.parquet', columns=['col1', 'col2'])
+```
+
+**2. Filter Before GroupBy**
+```python
+# Better: Reduce data before expensive operations
+result = ddf[ddf['year'] == 2024].groupby('category').sum().compute()
+```
+
+**3. Use Efficient File Formats**
+```python
+# Use Parquet instead of CSV for better performance
+ddf.to_parquet('data.parquet')  # Faster, smaller, columnar
+```
+
+**4. Repartition Appropriately**
+```python
+# If partitions are too small
+ddf = ddf.repartition(npartitions=10)
+
+# If partitions are too large
+ddf = ddf.repartition(partition_size='100MB')
+```
+
+## Common Patterns
+
+### ETL Pipeline
+```python
+import dask.dataframe as dd
+
+# Read data
+ddf = dd.read_csv('raw_data/*.csv')
+
+# Transform
+ddf = ddf[ddf['status'] == 'valid']
+ddf['amount'] = ddf['amount'].astype('float64')
+ddf = ddf.dropna(subset=['important_col'])
+
+# Aggregate
+summary = ddf.groupby('category').agg({
+    'amount': ['sum', 'mean'],
+    'quantity': 'count'
+})
+
+# Write results
+summary.to_parquet('output/summary.parquet')
+```
+
+### Time Series Analysis
+```python
+# Read time series data
+ddf = dd.read_parquet('timeseries/*.parquet')
+
+# Set timestamp index
+ddf = ddf.set_index('timestamp', sorted=True)
+
+# Resample (if available in Dask version)
+hourly = ddf.resample('1H').mean()
+
+# Compute statistics
+result = hourly.compute()
+```
+
+### Combining Multiple Files
+```python
+# Read multiple files as single DataFrame
+ddf = dd.read_csv('data/2024-*.csv')
+
+# Process combined data
+result = ddf.groupby('category')['value'].sum().compute()
+```
+
+## Limitations and Differences from Pandas
+
+### Not All Pandas Features Available
+Some pandas operations are not implemented in Dask:
+- Some string methods
+- Certain window functions
+- Some specialized statistical functions
+
+### Partitioning Matters
+- Operations within partitions are efficient
+- Cross-partition operations may be expensive
+- Index-based operations benefit from sorted index
+
+### Lazy Evaluation
+- Operations don't execute until `.compute()`
+- Need to be aware of computation triggers
+- Can't inspect intermediate results without computing
+
+## Debugging Tips
+
+### Inspect Partitions
+```python
+# Get number of partitions
+print(ddf.npartitions)
+
+# Compute single partition
+first_partition = ddf.get_partition(0).compute()
+
+# View first few rows (computes first partition)
+print(ddf.head())
+```
+
+### Validate Operations on Small Data
+```python
+# Test on small sample first
+sample = ddf.head(1000)
+# Validate logic works
+# Then scale to full dataset
+result = ddf.compute()
+```
+
+### Check Dtypes
+```python
+# Verify data types are correct
+print(ddf.dtypes)
+```
--- a/skills/dask/references/futures.md
+++ b/skills/dask/references/futures.md
@@ -0,0 +1,541 @@
+# Dask Futures
+
+## Overview
+
+Dask futures extend Python's `concurrent.futures` interface, enabling immediate (non-lazy) task execution. Unlike delayed computations (used in DataFrames, Arrays, and Bags), futures provide more flexibility in situations where computations may evolve over time or require dynamic workflow construction.
+
+## Core Concept
+
+Futures represent real-time task execution:
+- Tasks execute immediately when submitted (not lazy)
+- Each future represents a remote computation result
+- Automatic dependency tracking between futures
+- Enables dynamic, evolving workflows
+- Direct control over task scheduling and data placement
+
+## Key Capabilities
+
+### Real-Time Execution
+- Tasks run immediately when submitted
+- No need for explicit `.compute()` call
+- Get results with `.result()` method
+
+### Automatic Dependency Management
+When you submit tasks with future inputs, Dask automatically handles dependency tracking. Once all input futures have completed, they will be moved onto a single worker for efficient computation.
+
+### Dynamic Workflows
+Build computations that evolve based on intermediate results:
+- Submit new tasks based on previous results
+- Conditional execution paths
+- Iterative algorithms with varying structure
+
+## When to Use Futures
+
+**Use Futures When**:
+- Building dynamic, evolving workflows
+- Need immediate task execution (not lazy)
+- Computations depend on runtime conditions
+- Require fine control over task placement
+- Implementing custom parallel algorithms
+- Need stateful computations (with actors)
+
+**Use Other Collections When**:
+- Static, predefined computation graphs (use delayed, DataFrames, Arrays)
+- Simple data parallelism on large collections (use Bags, DataFrames)
+- Standard array/dataframe operations suffice
+
+## Setting Up Client
+
+Futures require a distributed client:
+
+```python
+from dask.distributed import Client
+
+# Local cluster (on single machine)
+client = Client()
+
+# Or specify resources
+client = Client(n_workers=4, threads_per_worker=2)
+
+# Or connect to existing cluster
+client = Client('scheduler-address:8786')
+```
+
+## Submitting Tasks
+
+### Basic Submit
+```python
+from dask.distributed import Client
+
+client = Client()
+
+# Submit single task
+def add(x, y):
+    return x + y
+
+future = client.submit(add, 1, 2)
+
+# Get result
+result = future.result()  # Blocks until complete
+print(result)  # 3
+```
+
+### Multiple Tasks
+```python
+# Submit multiple independent tasks
+futures = []
+for i in range(10):
+    future = client.submit(add, i, i)
+    futures.append(future)
+
+# Gather results
+results = client.gather(futures)  # Efficient parallel gathering
+```
+
+### Map Over Inputs
+```python
+# Apply function to multiple inputs
+def square(x):
+    return x ** 2
+
+# Submit batch of tasks
+futures = client.map(square, range(100))
+
+# Gather results
+results = client.gather(futures)
+```
+
+**Note**: Each task carries ~1ms overhead, making `map` less suitable for millions of tiny tasks. For massive datasets, use Bags or DataFrames instead.
+
+## Working with Futures
+
+### Check Status
+```python
+future = client.submit(expensive_function, arg)
+
+# Check if complete
+print(future.done())  # False or True
+
+# Check status
+print(future.status)  # 'pending', 'running', 'finished', or 'error'
+```
+
+### Non-Blocking Result Retrieval
+```python
+# Non-blocking check
+if future.done():
+    result = future.result()
+else:
+    print("Still computing...")
+
+# Or use callbacks
+def handle_result(future):
+    print(f"Result: {future.result()}")
+
+future.add_done_callback(handle_result)
+```
+
+### Error Handling
+```python
+def might_fail(x):
+    if x < 0:
+        raise ValueError("Negative value")
+    return x ** 2
+
+future = client.submit(might_fail, -5)
+
+try:
+    result = future.result()
+except ValueError as e:
+    print(f"Task failed: {e}")
+```
+
+## Task Dependencies
+
+### Automatic Dependency Tracking
+```python
+# Submit task
+future1 = client.submit(add, 1, 2)
+
+# Use future as input (creates dependency)
+future2 = client.submit(add, future1, 10)  # Depends on future1
+
+# Chain dependencies
+future3 = client.submit(add, future2, 100)  # Depends on future2
+
+# Get final result
+result = future3.result()  # 113
+```
+
+### Complex Dependencies
+```python
+# Multiple dependencies
+a = client.submit(func1, x)
+b = client.submit(func2, y)
+c = client.submit(func3, a, b)  # Depends on both a and b
+
+result = c.result()
+```
+
+## Data Movement Optimization
+
+### Scatter Data
+Pre-scatter important data to avoid repeated transfers:
+
+```python
+# Upload data to cluster once
+large_dataset = client.scatter(big_data)  # Returns future
+
+# Use scattered data in multiple tasks
+futures = [client.submit(process, large_dataset, i) for i in range(100)]
+
+# Each task uses the same scattered data without re-transfer
+results = client.gather(futures)
+```
+
+### Efficient Gathering
+Use `client.gather()` for concurrent result collection:
+
+```python
+# Better: Gather all at once (parallel)
+results = client.gather(futures)
+
+# Worse: Sequential result retrieval
+results = [f.result() for f in futures]
+```
+
+## Fire-and-Forget
+
+For side-effect tasks without needing the result:
+
+```python
+from dask.distributed import fire_and_forget
+
+def log_to_database(data):
+    # Write to database, no return value needed
+    database.write(data)
+
+# Submit without keeping reference
+future = client.submit(log_to_database, data)
+fire_and_forget(future)
+
+# Dask won't abandon this computation even without active future reference
+```
+
+## Performance Characteristics
+
+### Task Overhead
+- ~1ms overhead per task
+- Good for: Thousands of tasks
+- Not suitable for: Millions of tiny tasks
+
+### Worker-to-Worker Communication
+- Direct worker-to-worker data transfer
+- Roundtrip latency: ~1ms
+- Efficient for task dependencies
+
+### Memory Management
+Dask tracks active futures locally. When a future is garbage collected by your local Python session, Dask will feel free to delete that data.
+
+**Keep References**:
+```python
+# Keep reference to prevent deletion
+important_result = client.submit(expensive_calc, data)
+
+# Use result multiple times
+future1 = client.submit(process1, important_result)
+future2 = client.submit(process2, important_result)
+```
+
+## Advanced Coordination
+
+### Distributed Primitives
+
+**Queues**:
+```python
+from dask.distributed import Queue
+
+queue = Queue()
+
+def producer():
+    for i in range(10):
+        queue.put(i)
+
+def consumer():
+    results = []
+    for _ in range(10):
+        results.append(queue.get())
+    return results
+
+# Submit tasks
+client.submit(producer)
+result_future = client.submit(consumer)
+results = result_future.result()
+```
+
+**Locks**:
+```python
+from dask.distributed import Lock
+
+lock = Lock()
+
+def critical_section():
+    with lock:
+        # Only one task executes this at a time
+        shared_resource.update()
+```
+
+**Events**:
+```python
+from dask.distributed import Event
+
+event = Event()
+
+def waiter():
+    event.wait()  # Blocks until event is set
+    return "Event occurred"
+
+def setter():
+    time.sleep(5)
+    event.set()
+
+# Start both tasks
+wait_future = client.submit(waiter)
+set_future = client.submit(setter)
+
+result = wait_future.result()  # Waits for setter to complete
+```
+
+**Variables**:
+```python
+from dask.distributed import Variable
+
+var = Variable('my-var')
+
+# Set value
+var.set(42)
+
+# Get value from tasks
+def reader():
+    return var.get()
+
+future = client.submit(reader)
+print(future.result())  # 42
+```
+
+## Actors
+
+For stateful, rapidly-changing workflows, actors enable worker-to-worker roundtrip latency around 1ms while bypassing scheduler coordination.
+
+### Creating Actors
+```python
+from dask.distributed import Client
+
+client = Client()
+
+class Counter:
+    def __init__(self):
+        self.count = 0
+
+    def increment(self):
+        self.count += 1
+        return self.count
+
+    def get_count(self):
+        return self.count
+
+# Create actor on worker
+counter = client.submit(Counter, actor=True).result()
+
+# Call methods
+future1 = counter.increment()
+future2 = counter.increment()
+result = counter.get_count().result()
+print(result)  # 2
+```
+
+### Actor Use Cases
+- Stateful services (databases, caches)
+- Rapidly changing state
+- Complex coordination patterns
+- Real-time streaming applications
+
+## Common Patterns
+
+### Embarrassingly Parallel Tasks
+```python
+from dask.distributed import Client
+
+client = Client()
+
+def process_item(item):
+    # Independent computation
+    return expensive_computation(item)
+
+# Process many items in parallel
+items = range(1000)
+futures = client.map(process_item, items)
+
+# Gather all results
+results = client.gather(futures)
+```
+
+### Dynamic Task Submission
+```python
+def recursive_compute(data, depth):
+    if depth == 0:
+        return process(data)
+
+    # Split and recurse
+    left, right = split(data)
+    left_future = client.submit(recursive_compute, left, depth - 1)
+    right_future = client.submit(recursive_compute, right, depth - 1)
+
+    # Combine results
+    return combine(left_future.result(), right_future.result())
+
+# Start computation
+result_future = client.submit(recursive_compute, initial_data, 5)
+result = result_future.result()
+```
+
+### Parameter Sweep
+```python
+from itertools import product
+
+def run_simulation(param1, param2, param3):
+    # Run simulation with parameters
+    return simulate(param1, param2, param3)
+
+# Generate parameter combinations
+params = product(range(10), range(10), range(10))
+
+# Submit all combinations
+futures = [client.submit(run_simulation, p1, p2, p3) for p1, p2, p3 in params]
+
+# Gather results as they complete
+from dask.distributed import as_completed
+
+for future in as_completed(futures):
+    result = future.result()
+    process_result(result)
+```
+
+### Pipeline with Dependencies
+```python
+# Stage 1: Load data
+load_futures = [client.submit(load_data, file) for file in files]
+
+# Stage 2: Process (depends on stage 1)
+process_futures = [client.submit(process, f) for f in load_futures]
+
+# Stage 3: Aggregate (depends on stage 2)
+agg_future = client.submit(aggregate, process_futures)
+
+# Get final result
+result = agg_future.result()
+```
+
+### Iterative Algorithm
+```python
+# Initialize
+state = client.scatter(initial_state)
+
+# Iterate
+for iteration in range(num_iterations):
+    # Compute update based on current state
+    state = client.submit(update_function, state)
+
+    # Check convergence
+    converged = client.submit(check_convergence, state)
+    if converged.result():
+        break
+
+# Get final state
+final_state = state.result()
+```
+
+## Best Practices
+
+### 1. Pre-scatter Large Data
+```python
+# Upload once, use many times
+large_data = client.scatter(big_dataset)
+futures = [client.submit(process, large_data, i) for i in range(100)]
+```
+
+### 2. Use Gather for Bulk Retrieval
+```python
+# Efficient: Parallel gathering
+results = client.gather(futures)
+
+# Inefficient: Sequential
+results = [f.result() for f in futures]
+```
+
+### 3. Manage Memory with References
+```python
+# Keep important futures
+important = client.submit(expensive_calc, data)
+
+# Use multiple times
+f1 = client.submit(use_result, important)
+f2 = client.submit(use_result, important)
+
+# Clean up when done
+del important
+```
+
+### 4. Handle Errors Appropriately
+```python
+futures = client.map(might_fail, inputs)
+
+# Check for errors
+results = []
+errors = []
+for future in as_completed(futures):
+    try:
+        results.append(future.result())
+    except Exception as e:
+        errors.append(e)
+```
+
+### 5. Use as_completed for Progressive Processing
+```python
+from dask.distributed import as_completed
+
+futures = client.map(process, items)
+
+# Process results as they arrive
+for future in as_completed(futures):
+    result = future.result()
+    handle_result(result)
+```
+
+## Debugging Tips
+
+### Monitor Dashboard
+View the Dask dashboard to see:
+- Task progress
+- Worker utilization
+- Memory usage
+- Task dependencies
+
+### Check Task Status
+```python
+# Inspect future
+print(future.status)
+print(future.done())
+
+# Get traceback on error
+try:
+    future.result()
+except Exception:
+    print(future.traceback())
+```
+
+### Profile Tasks
+```python
+# Get performance data
+client.profile(filename='profile.html')
+```
--- a/skills/dask/references/schedulers.md
+++ b/skills/dask/references/schedulers.md
@@ -0,0 +1,504 @@
+# Dask Schedulers
+
+## Overview
+
+Dask provides multiple task schedulers, each suited to different workloads. The scheduler determines how tasks are executed: sequentially, in parallel threads, in parallel processes, or distributed across a cluster.
+
+## Scheduler Types
+
+### Single-Machine Schedulers
+
+#### 1. Local Threads (Default)
+
+**Description**: The threaded scheduler executes computations with a local `concurrent.futures.ThreadPoolExecutor`.
+
+**When to Use**:
+- Numeric computations in NumPy, Pandas, scikit-learn
+- Libraries that release the GIL (Global Interpreter Lock)
+- Operations benefit from shared memory access
+- Default for Dask Arrays and DataFrames
+
+**Characteristics**:
+- Low overhead
+- Shared memory between threads
+- Best for GIL-releasing operations
+- Poor for pure Python code (GIL contention)
+
+**Example**:
+```python
+import dask.array as da
+
+# Uses threads by default
+x = da.random.random((10000, 10000), chunks=(1000, 1000))
+result = x.mean().compute()  # Computed with threads
+```
+
+**Explicit Configuration**:
+```python
+import dask
+
+# Set globally
+dask.config.set(scheduler='threads')
+
+# Or per-compute
+result = x.mean().compute(scheduler='threads')
+```
+
+#### 2. Local Processes
+
+**Description**: Multiprocessing scheduler that uses `concurrent.futures.ProcessPoolExecutor`.
+
+**When to Use**:
+- Pure Python code with GIL contention
+- Text processing and Python collections
+- Operations that benefit from process isolation
+- CPU-bound Python code
+
+**Characteristics**:
+- Bypasses GIL limitations
+- Incurs data transfer costs between processes
+- Higher overhead than threads
+- Ideal for linear workflows with small inputs/outputs
+
+**Example**:
+```python
+import dask.bag as db
+
+# Good for Python object processing
+bag = db.read_text('data/*.txt')
+result = bag.map(complex_python_function).compute(scheduler='processes')
+```
+
+**Explicit Configuration**:
+```python
+import dask
+
+# Set globally
+dask.config.set(scheduler='processes')
+
+# Or per-compute
+result = computation.compute(scheduler='processes')
+```
+
+**Limitations**:
+- Data must be serializable (pickle)
+- Overhead from process creation
+- Memory overhead from data copying
+
+#### 3. Single Thread (Synchronous)
+
+**Description**: The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all.
+
+**When to Use**:
+- Debugging with pdb
+- Profiling with standard Python tools
+- Understanding errors in detail
+- Development and testing
+
+**Characteristics**:
+- No parallelism
+- Easy debugging
+- No overhead
+- Deterministic execution
+
+**Example**:
+```python
+import dask
+
+# Enable for debugging
+dask.config.set(scheduler='synchronous')
+
+# Now can use pdb
+result = computation.compute()  # Runs in single thread
+```
+
+**Debugging with IPython**:
+```python
+# In IPython/Jupyter
+%pdb on
+
+dask.config.set(scheduler='synchronous')
+result = problematic_computation.compute()  # Drops into debugger on error
+```
+
+### Distributed Schedulers
+
+#### 4. Local Distributed
+
+**Description**: Despite its name, this scheduler runs effectively on personal machines using the distributed scheduler infrastructure.
+
+**When to Use**:
+- Need diagnostic dashboard
+- Asynchronous APIs
+- Better data locality handling than multiprocessing
+- Development before scaling to cluster
+- Want distributed features on single machine
+
+**Characteristics**:
+- Provides dashboard for monitoring
+- Better memory management
+- More overhead than threads/processes
+- Can scale to cluster later
+
+**Example**:
+```python
+from dask.distributed import Client
+import dask.dataframe as dd
+
+# Create local cluster
+client = Client()  # Automatically uses all cores
+
+# Use distributed scheduler
+ddf = dd.read_csv('data.csv')
+result = ddf.groupby('category').mean().compute()
+
+# View dashboard
+print(client.dashboard_link)
+
+# Clean up
+client.close()
+```
+
+**Configuration Options**:
+```python
+# Control resources
+client = Client(
+    n_workers=4,
+    threads_per_worker=2,
+    memory_limit='4GB'
+)
+```
+
+#### 5. Cluster Distributed
+
+**Description**: For scaling across multiple machines using the distributed scheduler.
+
+**When to Use**:
+- Data exceeds single machine capacity
+- Need computational power beyond one machine
+- Production deployments
+- Cluster computing environments (HPC, cloud)
+
+**Characteristics**:
+- Scales to hundreds of machines
+- Requires cluster setup
+- Network communication overhead
+- Advanced features (adaptive scaling, task prioritization)
+
+**Example with Dask-Jobqueue (HPC)**:
+```python
+from dask_jobqueue import SLURMCluster
+from dask.distributed import Client
+
+# Create cluster on HPC with SLURM
+cluster = SLURMCluster(
+    cores=24,
+    memory='100GB',
+    walltime='02:00:00',
+    queue='regular'
+)
+
+# Scale to 10 jobs
+cluster.scale(jobs=10)
+
+# Connect client
+client = Client(cluster)
+
+# Run computation
+result = computation.compute()
+
+client.close()
+```
+
+**Example with Dask on Kubernetes**:
+```python
+from dask_kubernetes import KubeCluster
+from dask.distributed import Client
+
+cluster = KubeCluster()
+cluster.scale(20)  # 20 workers
+
+client = Client(cluster)
+result = computation.compute()
+
+client.close()
+```
+
+## Scheduler Configuration
+
+### Global Configuration
+
+```python
+import dask
+
+# Set scheduler globally for session
+dask.config.set(scheduler='threads')
+dask.config.set(scheduler='processes')
+dask.config.set(scheduler='synchronous')
+```
+
+### Context Manager
+
+```python
+import dask
+
+# Temporarily use different scheduler
+with dask.config.set(scheduler='processes'):
+    result = computation.compute()
+
+# Back to default scheduler
+result2 = computation2.compute()
+```
+
+### Per-Compute
+
+```python
+# Specify scheduler per compute call
+result = computation.compute(scheduler='threads')
+result = computation.compute(scheduler='processes')
+result = computation.compute(scheduler='synchronous')
+```
+
+### Distributed Client
+
+```python
+from dask.distributed import Client
+
+# Using client automatically sets distributed scheduler
+client = Client()
+
+# All computations use distributed scheduler
+result = computation.compute()
+
+client.close()
+```
+
+## Choosing the Right Scheduler
+
+### Decision Matrix
+
+| Workload Type | Recommended Scheduler | Rationale |
+|--------------|----------------------|-----------|
+| NumPy/Pandas operations | Threads (default) | GIL-releasing, shared memory |
+| Pure Python objects | Processes | Avoids GIL contention |
+| Text/log processing | Processes | Python-heavy operations |
+| Debugging | Synchronous | Easy debugging, deterministic |
+| Need dashboard | Local Distributed | Monitoring and diagnostics |
+| Multi-machine | Cluster Distributed | Exceeds single machine capacity |
+| Small data, quick tasks | Threads | Lowest overhead |
+| Large data, single machine | Local Distributed | Better memory management |
+
+### Performance Considerations
+
+**Threads**:
+- Overhead: ~10 µs per task
+- Best for: Numeric operations
+- Memory: Shared
+- GIL: Affected by GIL
+
+**Processes**:
+- Overhead: ~10 ms per task
+- Best for: Python operations
+- Memory: Copied between processes
+- GIL: Not affected
+
+**Synchronous**:
+- Overhead: ~1 µs per task
+- Best for: Debugging
+- Memory: No parallelism
+- GIL: Not relevant
+
+**Distributed**:
+- Overhead: ~1 ms per task
+- Best for: Complex workflows, monitoring
+- Memory: Managed by scheduler
+- GIL: Workers can use threads or processes
+
+## Thread Configuration for Distributed Scheduler
+
+### Setting Thread Count
+
+```python
+from dask.distributed import Client
+
+# Control thread/worker configuration
+client = Client(
+    n_workers=4,           # Number of worker processes
+    threads_per_worker=2   # Threads per worker process
+)
+```
+
+### Recommended Configuration
+
+**For Numeric Workloads**:
+- Aim for roughly 4 threads per process
+- Balance between parallelism and overhead
+- Example: 8 cores → 2 workers with 4 threads each
+
+**For Python Workloads**:
+- Use more workers with fewer threads
+- Example: 8 cores → 8 workers with 1 thread each
+
+### Environment Variables
+
+```bash
+# Set thread count via environment
+export DASK_NUM_WORKERS=4
+export DASK_THREADS_PER_WORKER=2
+
+# Or via config file
+```
+
+## Common Patterns
+
+### Development to Production
+
+```python
+# Development: Use local distributed for testing
+from dask.distributed import Client
+client = Client(processes=False)  # In-process for debugging
+
+# Production: Scale to cluster
+from dask.distributed import Client
+client = Client('scheduler-address:8786')
+```
+
+### Mixed Workloads
+
+```python
+import dask
+import dask.dataframe as dd
+
+# Use threads for DataFrame operations
+ddf = dd.read_parquet('data.parquet')
+result1 = ddf.mean().compute(scheduler='threads')
+
+# Use processes for Python code
+import dask.bag as db
+bag = db.read_text('logs/*.txt')
+result2 = bag.map(parse_log).compute(scheduler='processes')
+```
+
+### Debugging Workflow
+
+```python
+import dask
+
+# Step 1: Debug with synchronous scheduler
+dask.config.set(scheduler='synchronous')
+result = problematic_computation.compute()
+
+# Step 2: Test with threads
+dask.config.set(scheduler='threads')
+result = computation.compute()
+
+# Step 3: Scale with distributed
+from dask.distributed import Client
+client = Client()
+result = computation.compute()
+```
+
+## Monitoring and Diagnostics
+
+### Dashboard Access (Distributed Only)
+
+```python
+from dask.distributed import Client
+
+client = Client()
+
+# Get dashboard URL
+print(client.dashboard_link)
+# Opens dashboard in browser showing:
+# - Task progress
+# - Worker status
+# - Memory usage
+# - Task stream
+# - Resource utilization
+```
+
+### Performance Profiling
+
+```python
+# Profile computation
+from dask.distributed import Client
+
+client = Client()
+result = computation.compute()
+
+# Get performance report
+client.profile(filename='profile.html')
+```
+
+### Resource Monitoring
+
+```python
+# Check worker info
+client.scheduler_info()
+
+# Get current tasks
+client.who_has()
+
+# Memory usage
+client.run(lambda: psutil.virtual_memory().percent)
+```
+
+## Advanced Configuration
+
+### Custom Executors
+
+```python
+from concurrent.futures import ThreadPoolExecutor
+import dask
+
+# Use custom thread pool
+with ThreadPoolExecutor(max_workers=4) as executor:
+    dask.config.set(pool=executor)
+    result = computation.compute(scheduler='threads')
+```
+
+### Adaptive Scaling (Distributed)
+
+```python
+from dask.distributed import Client
+
+client = Client()
+
+# Enable adaptive scaling
+client.cluster.adapt(minimum=2, maximum=10)
+
+# Cluster scales based on workload
+result = computation.compute()
+```
+
+### Worker Plugins
+
+```python
+from dask.distributed import Client, WorkerPlugin
+
+class CustomPlugin(WorkerPlugin):
+    def setup(self, worker):
+        # Initialize worker-specific resources
+        worker.custom_resource = initialize_resource()
+
+client = Client()
+client.register_worker_plugin(CustomPlugin())
+```
+
+## Troubleshooting
+
+### Slow Performance with Threads
+**Problem**: Pure Python code slow with threaded scheduler
+**Solution**: Switch to processes or distributed scheduler
+
+### Memory Errors with Processes
+**Problem**: Data too large to pickle/copy between processes
+**Solution**: Use threaded or distributed scheduler
+
+### Debugging Difficult
+**Problem**: Can't use pdb with parallel schedulers
+**Solution**: Use synchronous scheduler for debugging
+
+### Task Overhead High
+**Problem**: Many tiny tasks causing overhead
+**Solution**: Use threaded scheduler (lowest overhead) or increase chunk sizes