Files
gh-k-dense-ai-claude-scient…/skills/dask/references/bags.md
2025-11-30 08:30:10 +08:00

469 lines
11 KiB
Markdown

# Dask Bags
## Overview
Dask Bag implements functional operations including `map`, `filter`, `fold`, and `groupby` on generic Python objects. It processes data in parallel while maintaining a small memory footprint through Python iterators. Bags function as "a parallel version of PyToolz or a Pythonic version of the PySpark RDD."
## Core Concept
A Dask Bag is a collection of Python objects distributed across partitions:
- Each partition contains generic Python objects
- Operations use functional programming patterns
- Processing uses streaming/iterators for memory efficiency
- Ideal for unstructured or semi-structured data
## Key Capabilities
### Functional Operations
- `map`: Transform each element
- `filter`: Select elements based on condition
- `fold`: Reduce elements with combining function
- `groupby`: Group elements by key
- `pluck`: Extract fields from records
- `flatten`: Flatten nested structures
### Use Cases
- Text processing and log analysis
- JSON record processing
- ETL on unstructured data
- Data cleaning before structured analysis
## When to Use Dask Bags
**Use Bags When**:
- Working with general Python objects requiring flexible computation
- Data doesn't fit structured array or tabular formats
- Processing text, JSON, or custom Python objects
- Initial data cleaning and ETL is needed
- Memory-efficient streaming is important
**Use Other Collections When**:
- Data is structured (use DataFrames instead)
- Numeric computing (use Arrays instead)
- Operations require complex groupby or shuffles (use DataFrames)
**Key Recommendation**: Use Bag to clean and process data, then transform it into an array or DataFrame before embarking on more complex operations that require shuffle steps.
## Important Limitations
Bags sacrifice performance for generality:
- Rely on multiprocessing scheduling (not threads)
- Remain immutable (create new bags for changes)
- Operate slower than array/DataFrame equivalents
- Handle `groupby` inefficiently (use `foldby` when possible)
- Operations requiring substantial inter-worker communication are slow
## Creating Bags
### From Sequences
```python
import dask.bag as db
# From Python list
bag = db.from_sequence([1, 2, 3, 4, 5], partition_size=2)
# From range
bag = db.from_sequence(range(10000), partition_size=1000)
```
### From Text Files
```python
# Single file
bag = db.read_text('data.txt')
# Multiple files with glob
bag = db.read_text('data/*.txt')
# With encoding
bag = db.read_text('data/*.txt', encoding='utf-8')
# Custom line processing
bag = db.read_text('logs/*.log', blocksize='64MB')
```
### From Delayed Objects
```python
import dask
@dask.delayed
def load_data(filename):
with open(filename) as f:
return [line.strip() for line in f]
files = ['file1.txt', 'file2.txt', 'file3.txt']
partitions = [load_data(f) for f in files]
bag = db.from_delayed(partitions)
```
### From Custom Sources
```python
# From any iterable-producing function
def read_json_files():
import json
for filename in glob.glob('data/*.json'):
with open(filename) as f:
yield json.load(f)
# Create bag from generator
bag = db.from_sequence(read_json_files(), partition_size=10)
```
## Common Operations
### Map (Transform)
```python
import dask.bag as db
bag = db.read_text('data/*.json')
# Parse JSON
import json
parsed = bag.map(json.loads)
# Extract field
values = parsed.map(lambda x: x['value'])
# Complex transformation
def process_record(record):
return {
'id': record['id'],
'value': record['value'] * 2,
'category': record.get('category', 'unknown')
}
processed = parsed.map(process_record)
```
### Filter
```python
# Filter by condition
valid = parsed.filter(lambda x: x['status'] == 'valid')
# Multiple conditions
filtered = parsed.filter(lambda x: x['value'] > 100 and x['year'] == 2024)
# Filter with custom function
def is_valid_record(record):
return record.get('status') == 'valid' and record.get('value') is not None
valid_records = parsed.filter(is_valid_record)
```
### Pluck (Extract Fields)
```python
# Extract single field
ids = parsed.pluck('id')
# Extract multiple fields (creates tuples)
key_pairs = parsed.pluck(['id', 'value'])
```
### Flatten
```python
# Flatten nested lists
nested = db.from_sequence([[1, 2], [3, 4], [5, 6]])
flat = nested.flatten() # [1, 2, 3, 4, 5, 6]
# Flatten after map
bag = db.read_text('data/*.txt')
words = bag.map(str.split).flatten() # All words from all files
```
### GroupBy (Expensive)
```python
# Group by key (requires shuffle)
grouped = parsed.groupby(lambda x: x['category'])
# Aggregate after grouping
counts = grouped.map(lambda key_items: (key_items[0], len(list(key_items[1]))))
result = counts.compute()
```
### FoldBy (Preferred for Aggregations)
```python
# FoldBy is more efficient than groupby for aggregations
def add(acc, item):
return acc + item['value']
def combine(acc1, acc2):
return acc1 + acc2
# Sum values by category
sums = parsed.foldby(
key='category',
binop=add,
initial=0,
combine=combine
)
result = sums.compute()
```
### Reductions
```python
# Count elements
count = bag.count().compute()
# Get all distinct values (requires memory)
distinct = bag.distinct().compute()
# Take first n elements
first_ten = bag.take(10)
# Fold/reduce
total = bag.fold(
lambda acc, x: acc + x['value'],
initial=0,
combine=lambda a, b: a + b
).compute()
```
## Converting to Other Collections
### To DataFrame
```python
import dask.bag as db
import dask.dataframe as dd
# Bag of dictionaries
bag = db.read_text('data/*.json').map(json.loads)
# Convert to DataFrame
ddf = bag.to_dataframe()
# With explicit columns
ddf = bag.to_dataframe(meta={'id': int, 'value': float, 'category': str})
```
### To List/Compute
```python
# Compute to Python list (loads all in memory)
result = bag.compute()
# Take sample
sample = bag.take(100)
```
## Common Patterns
### JSON Processing
```python
import dask.bag as db
import json
# Read and parse JSON files
bag = db.read_text('logs/*.json')
parsed = bag.map(json.loads)
# Filter valid records
valid = parsed.filter(lambda x: x.get('status') == 'success')
# Extract relevant fields
processed = valid.map(lambda x: {
'user_id': x['user']['id'],
'timestamp': x['timestamp'],
'value': x['metrics']['value']
})
# Convert to DataFrame for analysis
ddf = processed.to_dataframe()
# Analyze
summary = ddf.groupby('user_id')['value'].mean().compute()
```
### Log Analysis
```python
# Read log files
logs = db.read_text('logs/*.log')
# Parse log lines
def parse_log_line(line):
parts = line.split(' ')
return {
'timestamp': parts[0],
'level': parts[1],
'message': ' '.join(parts[2:])
}
parsed_logs = logs.map(parse_log_line)
# Filter errors
errors = parsed_logs.filter(lambda x: x['level'] == 'ERROR')
# Count by message pattern
error_counts = errors.foldby(
key='message',
binop=lambda acc, x: acc + 1,
initial=0,
combine=lambda a, b: a + b
)
result = error_counts.compute()
```
### Text Processing
```python
# Read text files
text = db.read_text('documents/*.txt')
# Split into words
words = text.map(str.lower).map(str.split).flatten()
# Count word frequencies
def increment(acc, word):
return acc + 1
def combine_counts(a, b):
return a + b
word_counts = words.foldby(
key=lambda word: word,
binop=increment,
initial=0,
combine=combine_counts
)
# Get top words
top_words = word_counts.compute()
sorted_words = sorted(top_words, key=lambda x: x[1], reverse=True)[:100]
```
### Data Cleaning Pipeline
```python
import dask.bag as db
import json
# Read raw data
raw = db.read_text('raw_data/*.json').map(json.loads)
# Validation function
def is_valid(record):
required_fields = ['id', 'timestamp', 'value']
return all(field in record for field in required_fields)
# Cleaning function
def clean_record(record):
return {
'id': int(record['id']),
'timestamp': record['timestamp'],
'value': float(record['value']),
'category': record.get('category', 'unknown'),
'tags': record.get('tags', [])
}
# Pipeline
cleaned = (raw
.filter(is_valid)
.map(clean_record)
.filter(lambda x: x['value'] > 0)
)
# Convert to DataFrame
ddf = cleaned.to_dataframe()
# Save cleaned data
ddf.to_parquet('cleaned_data/')
```
## Performance Considerations
### Efficient Operations
- Map, filter, pluck: Very efficient (streaming)
- Flatten: Efficient
- FoldBy with good key distribution: Reasonable
- Take and head: Efficient (only processes needed partitions)
### Expensive Operations
- GroupBy: Requires shuffle, can be slow
- Distinct: Requires collecting all unique values
- Operations requiring full data materialization
### Optimization Tips
**1. Use FoldBy Instead of GroupBy**
```python
# Better: Use foldby for aggregations
result = bag.foldby(key='category', binop=add, initial=0, combine=sum)
# Worse: GroupBy then reduce
result = bag.groupby('category').map(lambda x: (x[0], sum(x[1])))
```
**2. Convert to DataFrame Early**
```python
# For structured operations, convert to DataFrame
bag = db.read_text('data/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')
ddf = bag.to_dataframe() # Now use efficient DataFrame operations
```
**3. Control Partition Size**
```python
# Balance between too many and too few partitions
bag = db.read_text('data/*.txt', blocksize='64MB') # Reasonable partition size
```
**4. Use Lazy Evaluation**
```python
# Chain operations before computing
result = (bag
.map(process1)
.filter(condition)
.map(process2)
.compute() # Single compute at the end
)
```
## Debugging Tips
### Inspect Partitions
```python
# Get number of partitions
print(bag.npartitions)
# Take sample
sample = bag.take(10)
print(sample)
```
### Validate on Small Data
```python
# Test logic on small subset
small_bag = db.from_sequence(sample_data, partition_size=10)
result = process_pipeline(small_bag).compute()
# Validate results, then scale
```
### Check Intermediate Results
```python
# Compute intermediate steps to debug
step1 = bag.map(parse).take(5)
print("After parsing:", step1)
step2 = bag.map(parse).filter(validate).take(5)
print("After filtering:", step2)
```
## Memory Management
Bags are designed for memory-efficient processing:
```python
# Streaming processing - doesn't load all in memory
bag = db.read_text('huge_file.txt') # Lazy
processed = bag.map(process_line) # Still lazy
result = processed.compute() # Processes in chunks
```
For very large results, avoid computing to memory:
```python
# Don't compute huge results to memory
# result = bag.compute() # Could overflow memory
# Instead, convert and save to disk
ddf = bag.to_dataframe()
ddf.to_parquet('output/')
```