Initial commit
This commit is contained in:
541
skills/dask/references/futures.md
Normal file
541
skills/dask/references/futures.md
Normal file
@@ -0,0 +1,541 @@
|
||||
# Dask Futures
|
||||
|
||||
## Overview
|
||||
|
||||
Dask futures extend Python's `concurrent.futures` interface, enabling immediate (non-lazy) task execution. Unlike delayed computations (used in DataFrames, Arrays, and Bags), futures provide more flexibility in situations where computations may evolve over time or require dynamic workflow construction.
|
||||
|
||||
## Core Concept
|
||||
|
||||
Futures represent real-time task execution:
|
||||
- Tasks execute immediately when submitted (not lazy)
|
||||
- Each future represents a remote computation result
|
||||
- Automatic dependency tracking between futures
|
||||
- Enables dynamic, evolving workflows
|
||||
- Direct control over task scheduling and data placement
|
||||
|
||||
## Key Capabilities
|
||||
|
||||
### Real-Time Execution
|
||||
- Tasks run immediately when submitted
|
||||
- No need for explicit `.compute()` call
|
||||
- Get results with `.result()` method
|
||||
|
||||
### Automatic Dependency Management
|
||||
When you submit tasks with future inputs, Dask automatically handles dependency tracking. Once all input futures have completed, they will be moved onto a single worker for efficient computation.
|
||||
|
||||
### Dynamic Workflows
|
||||
Build computations that evolve based on intermediate results:
|
||||
- Submit new tasks based on previous results
|
||||
- Conditional execution paths
|
||||
- Iterative algorithms with varying structure
|
||||
|
||||
## When to Use Futures
|
||||
|
||||
**Use Futures When**:
|
||||
- Building dynamic, evolving workflows
|
||||
- Need immediate task execution (not lazy)
|
||||
- Computations depend on runtime conditions
|
||||
- Require fine control over task placement
|
||||
- Implementing custom parallel algorithms
|
||||
- Need stateful computations (with actors)
|
||||
|
||||
**Use Other Collections When**:
|
||||
- Static, predefined computation graphs (use delayed, DataFrames, Arrays)
|
||||
- Simple data parallelism on large collections (use Bags, DataFrames)
|
||||
- Standard array/dataframe operations suffice
|
||||
|
||||
## Setting Up Client
|
||||
|
||||
Futures require a distributed client:
|
||||
|
||||
```python
|
||||
from dask.distributed import Client
|
||||
|
||||
# Local cluster (on single machine)
|
||||
client = Client()
|
||||
|
||||
# Or specify resources
|
||||
client = Client(n_workers=4, threads_per_worker=2)
|
||||
|
||||
# Or connect to existing cluster
|
||||
client = Client('scheduler-address:8786')
|
||||
```
|
||||
|
||||
## Submitting Tasks
|
||||
|
||||
### Basic Submit
|
||||
```python
|
||||
from dask.distributed import Client
|
||||
|
||||
client = Client()
|
||||
|
||||
# Submit single task
|
||||
def add(x, y):
|
||||
return x + y
|
||||
|
||||
future = client.submit(add, 1, 2)
|
||||
|
||||
# Get result
|
||||
result = future.result() # Blocks until complete
|
||||
print(result) # 3
|
||||
```
|
||||
|
||||
### Multiple Tasks
|
||||
```python
|
||||
# Submit multiple independent tasks
|
||||
futures = []
|
||||
for i in range(10):
|
||||
future = client.submit(add, i, i)
|
||||
futures.append(future)
|
||||
|
||||
# Gather results
|
||||
results = client.gather(futures) # Efficient parallel gathering
|
||||
```
|
||||
|
||||
### Map Over Inputs
|
||||
```python
|
||||
# Apply function to multiple inputs
|
||||
def square(x):
|
||||
return x ** 2
|
||||
|
||||
# Submit batch of tasks
|
||||
futures = client.map(square, range(100))
|
||||
|
||||
# Gather results
|
||||
results = client.gather(futures)
|
||||
```
|
||||
|
||||
**Note**: Each task carries ~1ms overhead, making `map` less suitable for millions of tiny tasks. For massive datasets, use Bags or DataFrames instead.
|
||||
|
||||
## Working with Futures
|
||||
|
||||
### Check Status
|
||||
```python
|
||||
future = client.submit(expensive_function, arg)
|
||||
|
||||
# Check if complete
|
||||
print(future.done()) # False or True
|
||||
|
||||
# Check status
|
||||
print(future.status) # 'pending', 'running', 'finished', or 'error'
|
||||
```
|
||||
|
||||
### Non-Blocking Result Retrieval
|
||||
```python
|
||||
# Non-blocking check
|
||||
if future.done():
|
||||
result = future.result()
|
||||
else:
|
||||
print("Still computing...")
|
||||
|
||||
# Or use callbacks
|
||||
def handle_result(future):
|
||||
print(f"Result: {future.result()}")
|
||||
|
||||
future.add_done_callback(handle_result)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
```python
|
||||
def might_fail(x):
|
||||
if x < 0:
|
||||
raise ValueError("Negative value")
|
||||
return x ** 2
|
||||
|
||||
future = client.submit(might_fail, -5)
|
||||
|
||||
try:
|
||||
result = future.result()
|
||||
except ValueError as e:
|
||||
print(f"Task failed: {e}")
|
||||
```
|
||||
|
||||
## Task Dependencies
|
||||
|
||||
### Automatic Dependency Tracking
|
||||
```python
|
||||
# Submit task
|
||||
future1 = client.submit(add, 1, 2)
|
||||
|
||||
# Use future as input (creates dependency)
|
||||
future2 = client.submit(add, future1, 10) # Depends on future1
|
||||
|
||||
# Chain dependencies
|
||||
future3 = client.submit(add, future2, 100) # Depends on future2
|
||||
|
||||
# Get final result
|
||||
result = future3.result() # 113
|
||||
```
|
||||
|
||||
### Complex Dependencies
|
||||
```python
|
||||
# Multiple dependencies
|
||||
a = client.submit(func1, x)
|
||||
b = client.submit(func2, y)
|
||||
c = client.submit(func3, a, b) # Depends on both a and b
|
||||
|
||||
result = c.result()
|
||||
```
|
||||
|
||||
## Data Movement Optimization
|
||||
|
||||
### Scatter Data
|
||||
Pre-scatter important data to avoid repeated transfers:
|
||||
|
||||
```python
|
||||
# Upload data to cluster once
|
||||
large_dataset = client.scatter(big_data) # Returns future
|
||||
|
||||
# Use scattered data in multiple tasks
|
||||
futures = [client.submit(process, large_dataset, i) for i in range(100)]
|
||||
|
||||
# Each task uses the same scattered data without re-transfer
|
||||
results = client.gather(futures)
|
||||
```
|
||||
|
||||
### Efficient Gathering
|
||||
Use `client.gather()` for concurrent result collection:
|
||||
|
||||
```python
|
||||
# Better: Gather all at once (parallel)
|
||||
results = client.gather(futures)
|
||||
|
||||
# Worse: Sequential result retrieval
|
||||
results = [f.result() for f in futures]
|
||||
```
|
||||
|
||||
## Fire-and-Forget
|
||||
|
||||
For side-effect tasks without needing the result:
|
||||
|
||||
```python
|
||||
from dask.distributed import fire_and_forget
|
||||
|
||||
def log_to_database(data):
|
||||
# Write to database, no return value needed
|
||||
database.write(data)
|
||||
|
||||
# Submit without keeping reference
|
||||
future = client.submit(log_to_database, data)
|
||||
fire_and_forget(future)
|
||||
|
||||
# Dask won't abandon this computation even without active future reference
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Task Overhead
|
||||
- ~1ms overhead per task
|
||||
- Good for: Thousands of tasks
|
||||
- Not suitable for: Millions of tiny tasks
|
||||
|
||||
### Worker-to-Worker Communication
|
||||
- Direct worker-to-worker data transfer
|
||||
- Roundtrip latency: ~1ms
|
||||
- Efficient for task dependencies
|
||||
|
||||
### Memory Management
|
||||
Dask tracks active futures locally. When a future is garbage collected by your local Python session, Dask will feel free to delete that data.
|
||||
|
||||
**Keep References**:
|
||||
```python
|
||||
# Keep reference to prevent deletion
|
||||
important_result = client.submit(expensive_calc, data)
|
||||
|
||||
# Use result multiple times
|
||||
future1 = client.submit(process1, important_result)
|
||||
future2 = client.submit(process2, important_result)
|
||||
```
|
||||
|
||||
## Advanced Coordination
|
||||
|
||||
### Distributed Primitives
|
||||
|
||||
**Queues**:
|
||||
```python
|
||||
from dask.distributed import Queue
|
||||
|
||||
queue = Queue()
|
||||
|
||||
def producer():
|
||||
for i in range(10):
|
||||
queue.put(i)
|
||||
|
||||
def consumer():
|
||||
results = []
|
||||
for _ in range(10):
|
||||
results.append(queue.get())
|
||||
return results
|
||||
|
||||
# Submit tasks
|
||||
client.submit(producer)
|
||||
result_future = client.submit(consumer)
|
||||
results = result_future.result()
|
||||
```
|
||||
|
||||
**Locks**:
|
||||
```python
|
||||
from dask.distributed import Lock
|
||||
|
||||
lock = Lock()
|
||||
|
||||
def critical_section():
|
||||
with lock:
|
||||
# Only one task executes this at a time
|
||||
shared_resource.update()
|
||||
```
|
||||
|
||||
**Events**:
|
||||
```python
|
||||
from dask.distributed import Event
|
||||
|
||||
event = Event()
|
||||
|
||||
def waiter():
|
||||
event.wait() # Blocks until event is set
|
||||
return "Event occurred"
|
||||
|
||||
def setter():
|
||||
time.sleep(5)
|
||||
event.set()
|
||||
|
||||
# Start both tasks
|
||||
wait_future = client.submit(waiter)
|
||||
set_future = client.submit(setter)
|
||||
|
||||
result = wait_future.result() # Waits for setter to complete
|
||||
```
|
||||
|
||||
**Variables**:
|
||||
```python
|
||||
from dask.distributed import Variable
|
||||
|
||||
var = Variable('my-var')
|
||||
|
||||
# Set value
|
||||
var.set(42)
|
||||
|
||||
# Get value from tasks
|
||||
def reader():
|
||||
return var.get()
|
||||
|
||||
future = client.submit(reader)
|
||||
print(future.result()) # 42
|
||||
```
|
||||
|
||||
## Actors
|
||||
|
||||
For stateful, rapidly-changing workflows, actors enable worker-to-worker roundtrip latency around 1ms while bypassing scheduler coordination.
|
||||
|
||||
### Creating Actors
|
||||
```python
|
||||
from dask.distributed import Client
|
||||
|
||||
client = Client()
|
||||
|
||||
class Counter:
|
||||
def __init__(self):
|
||||
self.count = 0
|
||||
|
||||
def increment(self):
|
||||
self.count += 1
|
||||
return self.count
|
||||
|
||||
def get_count(self):
|
||||
return self.count
|
||||
|
||||
# Create actor on worker
|
||||
counter = client.submit(Counter, actor=True).result()
|
||||
|
||||
# Call methods
|
||||
future1 = counter.increment()
|
||||
future2 = counter.increment()
|
||||
result = counter.get_count().result()
|
||||
print(result) # 2
|
||||
```
|
||||
|
||||
### Actor Use Cases
|
||||
- Stateful services (databases, caches)
|
||||
- Rapidly changing state
|
||||
- Complex coordination patterns
|
||||
- Real-time streaming applications
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Embarrassingly Parallel Tasks
|
||||
```python
|
||||
from dask.distributed import Client
|
||||
|
||||
client = Client()
|
||||
|
||||
def process_item(item):
|
||||
# Independent computation
|
||||
return expensive_computation(item)
|
||||
|
||||
# Process many items in parallel
|
||||
items = range(1000)
|
||||
futures = client.map(process_item, items)
|
||||
|
||||
# Gather all results
|
||||
results = client.gather(futures)
|
||||
```
|
||||
|
||||
### Dynamic Task Submission
|
||||
```python
|
||||
def recursive_compute(data, depth):
|
||||
if depth == 0:
|
||||
return process(data)
|
||||
|
||||
# Split and recurse
|
||||
left, right = split(data)
|
||||
left_future = client.submit(recursive_compute, left, depth - 1)
|
||||
right_future = client.submit(recursive_compute, right, depth - 1)
|
||||
|
||||
# Combine results
|
||||
return combine(left_future.result(), right_future.result())
|
||||
|
||||
# Start computation
|
||||
result_future = client.submit(recursive_compute, initial_data, 5)
|
||||
result = result_future.result()
|
||||
```
|
||||
|
||||
### Parameter Sweep
|
||||
```python
|
||||
from itertools import product
|
||||
|
||||
def run_simulation(param1, param2, param3):
|
||||
# Run simulation with parameters
|
||||
return simulate(param1, param2, param3)
|
||||
|
||||
# Generate parameter combinations
|
||||
params = product(range(10), range(10), range(10))
|
||||
|
||||
# Submit all combinations
|
||||
futures = [client.submit(run_simulation, p1, p2, p3) for p1, p2, p3 in params]
|
||||
|
||||
# Gather results as they complete
|
||||
from dask.distributed import as_completed
|
||||
|
||||
for future in as_completed(futures):
|
||||
result = future.result()
|
||||
process_result(result)
|
||||
```
|
||||
|
||||
### Pipeline with Dependencies
|
||||
```python
|
||||
# Stage 1: Load data
|
||||
load_futures = [client.submit(load_data, file) for file in files]
|
||||
|
||||
# Stage 2: Process (depends on stage 1)
|
||||
process_futures = [client.submit(process, f) for f in load_futures]
|
||||
|
||||
# Stage 3: Aggregate (depends on stage 2)
|
||||
agg_future = client.submit(aggregate, process_futures)
|
||||
|
||||
# Get final result
|
||||
result = agg_future.result()
|
||||
```
|
||||
|
||||
### Iterative Algorithm
|
||||
```python
|
||||
# Initialize
|
||||
state = client.scatter(initial_state)
|
||||
|
||||
# Iterate
|
||||
for iteration in range(num_iterations):
|
||||
# Compute update based on current state
|
||||
state = client.submit(update_function, state)
|
||||
|
||||
# Check convergence
|
||||
converged = client.submit(check_convergence, state)
|
||||
if converged.result():
|
||||
break
|
||||
|
||||
# Get final state
|
||||
final_state = state.result()
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Pre-scatter Large Data
|
||||
```python
|
||||
# Upload once, use many times
|
||||
large_data = client.scatter(big_dataset)
|
||||
futures = [client.submit(process, large_data, i) for i in range(100)]
|
||||
```
|
||||
|
||||
### 2. Use Gather for Bulk Retrieval
|
||||
```python
|
||||
# Efficient: Parallel gathering
|
||||
results = client.gather(futures)
|
||||
|
||||
# Inefficient: Sequential
|
||||
results = [f.result() for f in futures]
|
||||
```
|
||||
|
||||
### 3. Manage Memory with References
|
||||
```python
|
||||
# Keep important futures
|
||||
important = client.submit(expensive_calc, data)
|
||||
|
||||
# Use multiple times
|
||||
f1 = client.submit(use_result, important)
|
||||
f2 = client.submit(use_result, important)
|
||||
|
||||
# Clean up when done
|
||||
del important
|
||||
```
|
||||
|
||||
### 4. Handle Errors Appropriately
|
||||
```python
|
||||
futures = client.map(might_fail, inputs)
|
||||
|
||||
# Check for errors
|
||||
results = []
|
||||
errors = []
|
||||
for future in as_completed(futures):
|
||||
try:
|
||||
results.append(future.result())
|
||||
except Exception as e:
|
||||
errors.append(e)
|
||||
```
|
||||
|
||||
### 5. Use as_completed for Progressive Processing
|
||||
```python
|
||||
from dask.distributed import as_completed
|
||||
|
||||
futures = client.map(process, items)
|
||||
|
||||
# Process results as they arrive
|
||||
for future in as_completed(futures):
|
||||
result = future.result()
|
||||
handle_result(result)
|
||||
```
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Monitor Dashboard
|
||||
View the Dask dashboard to see:
|
||||
- Task progress
|
||||
- Worker utilization
|
||||
- Memory usage
|
||||
- Task dependencies
|
||||
|
||||
### Check Task Status
|
||||
```python
|
||||
# Inspect future
|
||||
print(future.status)
|
||||
print(future.done())
|
||||
|
||||
# Get traceback on error
|
||||
try:
|
||||
future.result()
|
||||
except Exception:
|
||||
print(future.traceback())
|
||||
```
|
||||
|
||||
### Profile Tasks
|
||||
```python
|
||||
# Get performance data
|
||||
client.profile(filename='profile.html')
|
||||
```
|
||||
Reference in New Issue
Block a user