469 lines
11 KiB
Markdown
469 lines
11 KiB
Markdown
# Dask Bags
|
|
|
|
## Overview
|
|
|
|
Dask Bag implements functional operations including `map`, `filter`, `fold`, and `groupby` on generic Python objects. It processes data in parallel while maintaining a small memory footprint through Python iterators. Bags function as "a parallel version of PyToolz or a Pythonic version of the PySpark RDD."
|
|
|
|
## Core Concept
|
|
|
|
A Dask Bag is a collection of Python objects distributed across partitions:
|
|
- Each partition contains generic Python objects
|
|
- Operations use functional programming patterns
|
|
- Processing uses streaming/iterators for memory efficiency
|
|
- Ideal for unstructured or semi-structured data
|
|
|
|
## Key Capabilities
|
|
|
|
### Functional Operations
|
|
- `map`: Transform each element
|
|
- `filter`: Select elements based on condition
|
|
- `fold`: Reduce elements with combining function
|
|
- `groupby`: Group elements by key
|
|
- `pluck`: Extract fields from records
|
|
- `flatten`: Flatten nested structures
|
|
|
|
### Use Cases
|
|
- Text processing and log analysis
|
|
- JSON record processing
|
|
- ETL on unstructured data
|
|
- Data cleaning before structured analysis
|
|
|
|
## When to Use Dask Bags
|
|
|
|
**Use Bags When**:
|
|
- Working with general Python objects requiring flexible computation
|
|
- Data doesn't fit structured array or tabular formats
|
|
- Processing text, JSON, or custom Python objects
|
|
- Initial data cleaning and ETL is needed
|
|
- Memory-efficient streaming is important
|
|
|
|
**Use Other Collections When**:
|
|
- Data is structured (use DataFrames instead)
|
|
- Numeric computing (use Arrays instead)
|
|
- Operations require complex groupby or shuffles (use DataFrames)
|
|
|
|
**Key Recommendation**: Use Bag to clean and process data, then transform it into an array or DataFrame before embarking on more complex operations that require shuffle steps.
|
|
|
|
## Important Limitations
|
|
|
|
Bags sacrifice performance for generality:
|
|
- Rely on multiprocessing scheduling (not threads)
|
|
- Remain immutable (create new bags for changes)
|
|
- Operate slower than array/DataFrame equivalents
|
|
- Handle `groupby` inefficiently (use `foldby` when possible)
|
|
- Operations requiring substantial inter-worker communication are slow
|
|
|
|
## Creating Bags
|
|
|
|
### From Sequences
|
|
```python
|
|
import dask.bag as db
|
|
|
|
# From Python list
|
|
bag = db.from_sequence([1, 2, 3, 4, 5], partition_size=2)
|
|
|
|
# From range
|
|
bag = db.from_sequence(range(10000), partition_size=1000)
|
|
```
|
|
|
|
### From Text Files
|
|
```python
|
|
# Single file
|
|
bag = db.read_text('data.txt')
|
|
|
|
# Multiple files with glob
|
|
bag = db.read_text('data/*.txt')
|
|
|
|
# With encoding
|
|
bag = db.read_text('data/*.txt', encoding='utf-8')
|
|
|
|
# Custom line processing
|
|
bag = db.read_text('logs/*.log', blocksize='64MB')
|
|
```
|
|
|
|
### From Delayed Objects
|
|
```python
|
|
import dask
|
|
|
|
@dask.delayed
|
|
def load_data(filename):
|
|
with open(filename) as f:
|
|
return [line.strip() for line in f]
|
|
|
|
files = ['file1.txt', 'file2.txt', 'file3.txt']
|
|
partitions = [load_data(f) for f in files]
|
|
bag = db.from_delayed(partitions)
|
|
```
|
|
|
|
### From Custom Sources
|
|
```python
|
|
# From any iterable-producing function
|
|
def read_json_files():
|
|
import json
|
|
for filename in glob.glob('data/*.json'):
|
|
with open(filename) as f:
|
|
yield json.load(f)
|
|
|
|
# Create bag from generator
|
|
bag = db.from_sequence(read_json_files(), partition_size=10)
|
|
```
|
|
|
|
## Common Operations
|
|
|
|
### Map (Transform)
|
|
```python
|
|
import dask.bag as db
|
|
|
|
bag = db.read_text('data/*.json')
|
|
|
|
# Parse JSON
|
|
import json
|
|
parsed = bag.map(json.loads)
|
|
|
|
# Extract field
|
|
values = parsed.map(lambda x: x['value'])
|
|
|
|
# Complex transformation
|
|
def process_record(record):
|
|
return {
|
|
'id': record['id'],
|
|
'value': record['value'] * 2,
|
|
'category': record.get('category', 'unknown')
|
|
}
|
|
|
|
processed = parsed.map(process_record)
|
|
```
|
|
|
|
### Filter
|
|
```python
|
|
# Filter by condition
|
|
valid = parsed.filter(lambda x: x['status'] == 'valid')
|
|
|
|
# Multiple conditions
|
|
filtered = parsed.filter(lambda x: x['value'] > 100 and x['year'] == 2024)
|
|
|
|
# Filter with custom function
|
|
def is_valid_record(record):
|
|
return record.get('status') == 'valid' and record.get('value') is not None
|
|
|
|
valid_records = parsed.filter(is_valid_record)
|
|
```
|
|
|
|
### Pluck (Extract Fields)
|
|
```python
|
|
# Extract single field
|
|
ids = parsed.pluck('id')
|
|
|
|
# Extract multiple fields (creates tuples)
|
|
key_pairs = parsed.pluck(['id', 'value'])
|
|
```
|
|
|
|
### Flatten
|
|
```python
|
|
# Flatten nested lists
|
|
nested = db.from_sequence([[1, 2], [3, 4], [5, 6]])
|
|
flat = nested.flatten() # [1, 2, 3, 4, 5, 6]
|
|
|
|
# Flatten after map
|
|
bag = db.read_text('data/*.txt')
|
|
words = bag.map(str.split).flatten() # All words from all files
|
|
```
|
|
|
|
### GroupBy (Expensive)
|
|
```python
|
|
# Group by key (requires shuffle)
|
|
grouped = parsed.groupby(lambda x: x['category'])
|
|
|
|
# Aggregate after grouping
|
|
counts = grouped.map(lambda key_items: (key_items[0], len(list(key_items[1]))))
|
|
result = counts.compute()
|
|
```
|
|
|
|
### FoldBy (Preferred for Aggregations)
|
|
```python
|
|
# FoldBy is more efficient than groupby for aggregations
|
|
def add(acc, item):
|
|
return acc + item['value']
|
|
|
|
def combine(acc1, acc2):
|
|
return acc1 + acc2
|
|
|
|
# Sum values by category
|
|
sums = parsed.foldby(
|
|
key='category',
|
|
binop=add,
|
|
initial=0,
|
|
combine=combine
|
|
)
|
|
|
|
result = sums.compute()
|
|
```
|
|
|
|
### Reductions
|
|
```python
|
|
# Count elements
|
|
count = bag.count().compute()
|
|
|
|
# Get all distinct values (requires memory)
|
|
distinct = bag.distinct().compute()
|
|
|
|
# Take first n elements
|
|
first_ten = bag.take(10)
|
|
|
|
# Fold/reduce
|
|
total = bag.fold(
|
|
lambda acc, x: acc + x['value'],
|
|
initial=0,
|
|
combine=lambda a, b: a + b
|
|
).compute()
|
|
```
|
|
|
|
## Converting to Other Collections
|
|
|
|
### To DataFrame
|
|
```python
|
|
import dask.bag as db
|
|
import dask.dataframe as dd
|
|
|
|
# Bag of dictionaries
|
|
bag = db.read_text('data/*.json').map(json.loads)
|
|
|
|
# Convert to DataFrame
|
|
ddf = bag.to_dataframe()
|
|
|
|
# With explicit columns
|
|
ddf = bag.to_dataframe(meta={'id': int, 'value': float, 'category': str})
|
|
```
|
|
|
|
### To List/Compute
|
|
```python
|
|
# Compute to Python list (loads all in memory)
|
|
result = bag.compute()
|
|
|
|
# Take sample
|
|
sample = bag.take(100)
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
### JSON Processing
|
|
```python
|
|
import dask.bag as db
|
|
import json
|
|
|
|
# Read and parse JSON files
|
|
bag = db.read_text('logs/*.json')
|
|
parsed = bag.map(json.loads)
|
|
|
|
# Filter valid records
|
|
valid = parsed.filter(lambda x: x.get('status') == 'success')
|
|
|
|
# Extract relevant fields
|
|
processed = valid.map(lambda x: {
|
|
'user_id': x['user']['id'],
|
|
'timestamp': x['timestamp'],
|
|
'value': x['metrics']['value']
|
|
})
|
|
|
|
# Convert to DataFrame for analysis
|
|
ddf = processed.to_dataframe()
|
|
|
|
# Analyze
|
|
summary = ddf.groupby('user_id')['value'].mean().compute()
|
|
```
|
|
|
|
### Log Analysis
|
|
```python
|
|
# Read log files
|
|
logs = db.read_text('logs/*.log')
|
|
|
|
# Parse log lines
|
|
def parse_log_line(line):
|
|
parts = line.split(' ')
|
|
return {
|
|
'timestamp': parts[0],
|
|
'level': parts[1],
|
|
'message': ' '.join(parts[2:])
|
|
}
|
|
|
|
parsed_logs = logs.map(parse_log_line)
|
|
|
|
# Filter errors
|
|
errors = parsed_logs.filter(lambda x: x['level'] == 'ERROR')
|
|
|
|
# Count by message pattern
|
|
error_counts = errors.foldby(
|
|
key='message',
|
|
binop=lambda acc, x: acc + 1,
|
|
initial=0,
|
|
combine=lambda a, b: a + b
|
|
)
|
|
|
|
result = error_counts.compute()
|
|
```
|
|
|
|
### Text Processing
|
|
```python
|
|
# Read text files
|
|
text = db.read_text('documents/*.txt')
|
|
|
|
# Split into words
|
|
words = text.map(str.lower).map(str.split).flatten()
|
|
|
|
# Count word frequencies
|
|
def increment(acc, word):
|
|
return acc + 1
|
|
|
|
def combine_counts(a, b):
|
|
return a + b
|
|
|
|
word_counts = words.foldby(
|
|
key=lambda word: word,
|
|
binop=increment,
|
|
initial=0,
|
|
combine=combine_counts
|
|
)
|
|
|
|
# Get top words
|
|
top_words = word_counts.compute()
|
|
sorted_words = sorted(top_words, key=lambda x: x[1], reverse=True)[:100]
|
|
```
|
|
|
|
### Data Cleaning Pipeline
|
|
```python
|
|
import dask.bag as db
|
|
import json
|
|
|
|
# Read raw data
|
|
raw = db.read_text('raw_data/*.json').map(json.loads)
|
|
|
|
# Validation function
|
|
def is_valid(record):
|
|
required_fields = ['id', 'timestamp', 'value']
|
|
return all(field in record for field in required_fields)
|
|
|
|
# Cleaning function
|
|
def clean_record(record):
|
|
return {
|
|
'id': int(record['id']),
|
|
'timestamp': record['timestamp'],
|
|
'value': float(record['value']),
|
|
'category': record.get('category', 'unknown'),
|
|
'tags': record.get('tags', [])
|
|
}
|
|
|
|
# Pipeline
|
|
cleaned = (raw
|
|
.filter(is_valid)
|
|
.map(clean_record)
|
|
.filter(lambda x: x['value'] > 0)
|
|
)
|
|
|
|
# Convert to DataFrame
|
|
ddf = cleaned.to_dataframe()
|
|
|
|
# Save cleaned data
|
|
ddf.to_parquet('cleaned_data/')
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Efficient Operations
|
|
- Map, filter, pluck: Very efficient (streaming)
|
|
- Flatten: Efficient
|
|
- FoldBy with good key distribution: Reasonable
|
|
- Take and head: Efficient (only processes needed partitions)
|
|
|
|
### Expensive Operations
|
|
- GroupBy: Requires shuffle, can be slow
|
|
- Distinct: Requires collecting all unique values
|
|
- Operations requiring full data materialization
|
|
|
|
### Optimization Tips
|
|
|
|
**1. Use FoldBy Instead of GroupBy**
|
|
```python
|
|
# Better: Use foldby for aggregations
|
|
result = bag.foldby(key='category', binop=add, initial=0, combine=sum)
|
|
|
|
# Worse: GroupBy then reduce
|
|
result = bag.groupby('category').map(lambda x: (x[0], sum(x[1])))
|
|
```
|
|
|
|
**2. Convert to DataFrame Early**
|
|
```python
|
|
# For structured operations, convert to DataFrame
|
|
bag = db.read_text('data/*.json').map(json.loads)
|
|
bag = bag.filter(lambda x: x['status'] == 'valid')
|
|
ddf = bag.to_dataframe() # Now use efficient DataFrame operations
|
|
```
|
|
|
|
**3. Control Partition Size**
|
|
```python
|
|
# Balance between too many and too few partitions
|
|
bag = db.read_text('data/*.txt', blocksize='64MB') # Reasonable partition size
|
|
```
|
|
|
|
**4. Use Lazy Evaluation**
|
|
```python
|
|
# Chain operations before computing
|
|
result = (bag
|
|
.map(process1)
|
|
.filter(condition)
|
|
.map(process2)
|
|
.compute() # Single compute at the end
|
|
)
|
|
```
|
|
|
|
## Debugging Tips
|
|
|
|
### Inspect Partitions
|
|
```python
|
|
# Get number of partitions
|
|
print(bag.npartitions)
|
|
|
|
# Take sample
|
|
sample = bag.take(10)
|
|
print(sample)
|
|
```
|
|
|
|
### Validate on Small Data
|
|
```python
|
|
# Test logic on small subset
|
|
small_bag = db.from_sequence(sample_data, partition_size=10)
|
|
result = process_pipeline(small_bag).compute()
|
|
# Validate results, then scale
|
|
```
|
|
|
|
### Check Intermediate Results
|
|
```python
|
|
# Compute intermediate steps to debug
|
|
step1 = bag.map(parse).take(5)
|
|
print("After parsing:", step1)
|
|
|
|
step2 = bag.map(parse).filter(validate).take(5)
|
|
print("After filtering:", step2)
|
|
```
|
|
|
|
## Memory Management
|
|
|
|
Bags are designed for memory-efficient processing:
|
|
|
|
```python
|
|
# Streaming processing - doesn't load all in memory
|
|
bag = db.read_text('huge_file.txt') # Lazy
|
|
processed = bag.map(process_line) # Still lazy
|
|
result = processed.compute() # Processes in chunks
|
|
```
|
|
|
|
For very large results, avoid computing to memory:
|
|
|
|
```python
|
|
# Don't compute huge results to memory
|
|
# result = bag.compute() # Could overflow memory
|
|
|
|
# Instead, convert and save to disk
|
|
ddf = bag.to_dataframe()
|
|
ddf.to_parquet('output/')
|
|
```
|