556 lines
12 KiB
Markdown
556 lines
12 KiB
Markdown
# Data Processing and Manipulation
|
|
|
|
This reference covers filtering, selections, virtual columns, expressions, aggregations, groupby operations, and data transformations in Vaex.
|
|
|
|
## Filtering and Selections
|
|
|
|
Vaex uses boolean expressions to filter data efficiently without copying:
|
|
|
|
### Basic Filtering
|
|
|
|
```python
|
|
# Simple filter
|
|
df_filtered = df[df.age > 25]
|
|
|
|
# Multiple conditions
|
|
df_filtered = df[(df.age > 25) & (df.salary > 50000)]
|
|
df_filtered = df[(df.category == 'A') | (df.category == 'B')]
|
|
|
|
# Negation
|
|
df_filtered = df[~(df.age < 18)]
|
|
```
|
|
|
|
### Selection Objects
|
|
|
|
Vaex can maintain multiple named selections simultaneously:
|
|
|
|
```python
|
|
# Create named selection
|
|
df.select(df.age > 30, name='adults')
|
|
df.select(df.salary > 100000, name='high_earners')
|
|
|
|
# Use selection in operations
|
|
mean_age_adults = df.mean(df.age, selection='adults')
|
|
count_high_earners = df.count(selection='high_earners')
|
|
|
|
# Combine selections
|
|
df.select((df.age > 30) & (df.salary > 100000), name='adult_high_earners')
|
|
|
|
# List all selections
|
|
print(df.selection_names())
|
|
|
|
# Drop selection
|
|
df.select_drop('adults')
|
|
```
|
|
|
|
### Advanced Filtering
|
|
|
|
```python
|
|
# String matching
|
|
df_filtered = df[df.name.str.contains('John')]
|
|
df_filtered = df[df.name.str.startswith('A')]
|
|
df_filtered = df[df.email.str.endswith('@gmail.com')]
|
|
|
|
# Null/missing value filtering
|
|
df_filtered = df[df.age.isna()] # Keep missing
|
|
df_filtered = df[df.age.notna()] # Remove missing
|
|
|
|
# Value membership
|
|
df_filtered = df[df.category.isin(['A', 'B', 'C'])]
|
|
|
|
# Range filtering
|
|
df_filtered = df[df.age.between(25, 65)]
|
|
```
|
|
|
|
## Virtual Columns and Expressions
|
|
|
|
Virtual columns are computed on-the-fly with zero memory overhead:
|
|
|
|
### Creating Virtual Columns
|
|
|
|
```python
|
|
# Arithmetic operations
|
|
df['total'] = df.price * df.quantity
|
|
df['price_squared'] = df.price ** 2
|
|
|
|
# Mathematical functions
|
|
df['log_price'] = df.price.log()
|
|
df['sqrt_value'] = df.value.sqrt()
|
|
df['abs_diff'] = (df.x - df.y).abs()
|
|
|
|
# Conditional logic
|
|
df['is_adult'] = df.age >= 18
|
|
df['category'] = (df.score > 80).where('A', 'B') # If-then-else
|
|
```
|
|
|
|
### Expression Methods
|
|
|
|
```python
|
|
# Mathematical
|
|
df.x.abs() # Absolute value
|
|
df.x.sqrt() # Square root
|
|
df.x.log() # Natural log
|
|
df.x.log10() # Base-10 log
|
|
df.x.exp() # Exponential
|
|
|
|
# Trigonometric
|
|
df.angle.sin()
|
|
df.angle.cos()
|
|
df.angle.tan()
|
|
df.angle.arcsin()
|
|
|
|
# Rounding
|
|
df.x.round(2) # Round to 2 decimals
|
|
df.x.floor() # Round down
|
|
df.x.ceil() # Round up
|
|
|
|
# Type conversion
|
|
df.x.astype('int64')
|
|
df.x.astype('float32')
|
|
df.x.astype('str')
|
|
```
|
|
|
|
### Conditional Expressions
|
|
|
|
```python
|
|
# where() method: condition.where(true_value, false_value)
|
|
df['status'] = (df.age >= 18).where('adult', 'minor')
|
|
|
|
# Multiple conditions with nested where
|
|
df['grade'] = (df.score >= 90).where('A',
|
|
(df.score >= 80).where('B',
|
|
(df.score >= 70).where('C', 'F')))
|
|
|
|
# Using searchsorted for binning
|
|
bins = [0, 18, 65, 100]
|
|
labels = ['minor', 'adult', 'senior']
|
|
df['age_group'] = df.age.searchsorted(bins).where(...)
|
|
```
|
|
|
|
## String Operations
|
|
|
|
Access string methods via the `.str` accessor:
|
|
|
|
### Basic String Methods
|
|
|
|
```python
|
|
# Case conversion
|
|
df['upper_name'] = df.name.str.upper()
|
|
df['lower_name'] = df.name.str.lower()
|
|
df['title_name'] = df.name.str.title()
|
|
|
|
# Trimming
|
|
df['trimmed'] = df.text.str.strip()
|
|
df['ltrimmed'] = df.text.str.lstrip()
|
|
df['rtrimmed'] = df.text.str.rstrip()
|
|
|
|
# Searching
|
|
df['has_john'] = df.name.str.contains('John')
|
|
df['starts_with_a'] = df.name.str.startswith('A')
|
|
df['ends_with_com'] = df.email.str.endswith('.com')
|
|
|
|
# Slicing
|
|
df['first_char'] = df.name.str.slice(0, 1)
|
|
df['last_three'] = df.name.str.slice(-3, None)
|
|
|
|
# Length
|
|
df['name_length'] = df.name.str.len()
|
|
```
|
|
|
|
### Advanced String Operations
|
|
|
|
```python
|
|
# Replacing
|
|
df['clean_text'] = df.text.str.replace('bad', 'good')
|
|
|
|
# Splitting (returns first part)
|
|
df['first_name'] = df.full_name.str.split(' ')[0]
|
|
|
|
# Concatenation
|
|
df['full_name'] = df.first_name + ' ' + df.last_name
|
|
|
|
# Padding
|
|
df['padded'] = df.code.str.pad(10, '0', 'left') # Zero-padding
|
|
```
|
|
|
|
## DateTime Operations
|
|
|
|
Access datetime methods via the `.dt` accessor:
|
|
|
|
### DateTime Properties
|
|
|
|
```python
|
|
# Parsing strings to datetime
|
|
df['date_parsed'] = df.date_string.astype('datetime64')
|
|
|
|
# Extracting components
|
|
df['year'] = df.timestamp.dt.year
|
|
df['month'] = df.timestamp.dt.month
|
|
df['day'] = df.timestamp.dt.day
|
|
df['hour'] = df.timestamp.dt.hour
|
|
df['minute'] = df.timestamp.dt.minute
|
|
df['second'] = df.timestamp.dt.second
|
|
|
|
# Day of week
|
|
df['weekday'] = df.timestamp.dt.dayofweek # 0=Monday
|
|
df['day_name'] = df.timestamp.dt.day_name # 'Monday', 'Tuesday', ...
|
|
|
|
# Date arithmetic
|
|
df['tomorrow'] = df.date + pd.Timedelta(days=1)
|
|
df['next_week'] = df.date + pd.Timedelta(weeks=1)
|
|
```
|
|
|
|
## Aggregations
|
|
|
|
Vaex performs aggregations efficiently across billions of rows:
|
|
|
|
### Basic Aggregations
|
|
|
|
```python
|
|
# Single column
|
|
mean_age = df.age.mean()
|
|
std_age = df.age.std()
|
|
min_age = df.age.min()
|
|
max_age = df.age.max()
|
|
sum_sales = df.sales.sum()
|
|
count_rows = df.count()
|
|
|
|
# With selections
|
|
mean_adult_age = df.age.mean(selection='adults')
|
|
|
|
# Multiple at once with delay
|
|
mean = df.age.mean(delay=True)
|
|
std = df.age.std(delay=True)
|
|
results = vaex.execute([mean, std])
|
|
```
|
|
|
|
### Available Aggregation Functions
|
|
|
|
```python
|
|
# Central tendency
|
|
df.x.mean()
|
|
df.x.median_approx() # Approximate median (fast)
|
|
|
|
# Dispersion
|
|
df.x.std() # Standard deviation
|
|
df.x.var() # Variance
|
|
df.x.min()
|
|
df.x.max()
|
|
df.x.minmax() # Both min and max
|
|
|
|
# Count
|
|
df.count() # Total rows
|
|
df.x.count() # Non-missing values
|
|
|
|
# Sum and product
|
|
df.x.sum()
|
|
df.x.prod()
|
|
|
|
# Percentiles
|
|
df.x.quantile(0.5) # Median
|
|
df.x.quantile([0.25, 0.75]) # Quartiles
|
|
|
|
# Correlation
|
|
df.correlation(df.x, df.y)
|
|
df.covar(df.x, df.y)
|
|
|
|
# Higher moments
|
|
df.x.kurtosis()
|
|
df.x.skew()
|
|
|
|
# Unique values
|
|
df.x.nunique() # Count unique
|
|
df.x.unique() # Get unique values (returns array)
|
|
```
|
|
|
|
## GroupBy Operations
|
|
|
|
Group data and compute aggregations per group:
|
|
|
|
### Basic GroupBy
|
|
|
|
```python
|
|
# Single column groupby
|
|
grouped = df.groupby('category')
|
|
|
|
# Aggregation
|
|
result = grouped.agg({'sales': 'sum'})
|
|
result = grouped.agg({'sales': 'sum', 'quantity': 'mean'})
|
|
|
|
# Multiple aggregations on same column
|
|
result = grouped.agg({
|
|
'sales': ['sum', 'mean', 'std'],
|
|
'quantity': 'sum'
|
|
})
|
|
```
|
|
|
|
### Advanced GroupBy
|
|
|
|
```python
|
|
# Multiple grouping columns
|
|
result = df.groupby(['category', 'region']).agg({
|
|
'sales': 'sum',
|
|
'quantity': 'mean'
|
|
})
|
|
|
|
# Custom aggregation functions
|
|
result = df.groupby('category').agg({
|
|
'sales': lambda x: x.max() - x.min()
|
|
})
|
|
|
|
# Available aggregation functions
|
|
# 'sum', 'mean', 'std', 'min', 'max', 'count', 'first', 'last'
|
|
```
|
|
|
|
### GroupBy with Binning
|
|
|
|
```python
|
|
# Bin continuous variable and aggregate
|
|
result = df.groupby(vaex.vrange(0, 100, 10)).agg({
|
|
'sales': 'sum'
|
|
})
|
|
|
|
# Datetime binning
|
|
result = df.groupby(df.timestamp.dt.year).agg({
|
|
'sales': 'sum'
|
|
})
|
|
```
|
|
|
|
## Binning and Discretization
|
|
|
|
Create bins from continuous variables:
|
|
|
|
### Simple Binning
|
|
|
|
```python
|
|
# Create bins
|
|
df['age_bin'] = df.age.digitize([18, 30, 50, 65, 100])
|
|
|
|
# Labeled bins
|
|
bins = [0, 18, 30, 50, 65, 100]
|
|
labels = ['child', 'young_adult', 'adult', 'middle_age', 'senior']
|
|
df['age_group'] = df.age.digitize(bins)
|
|
# Note: Apply labels using where() or mapping
|
|
```
|
|
|
|
### Statistical Binning
|
|
|
|
```python
|
|
# Equal-width bins
|
|
df['value_bin'] = df.value.digitize(
|
|
vaex.vrange(df.value.min(), df.value.max(), 10)
|
|
)
|
|
|
|
# Quantile-based bins
|
|
quantiles = df.value.quantile([0.25, 0.5, 0.75])
|
|
df['value_quartile'] = df.value.digitize(quantiles)
|
|
```
|
|
|
|
## Multi-dimensional Aggregations
|
|
|
|
Compute statistics on grids:
|
|
|
|
```python
|
|
# 2D histogram/heatmap data
|
|
counts = df.count(binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(100, 100))
|
|
|
|
# Mean on a grid
|
|
mean_z = df.mean(df.z, binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(50, 50))
|
|
|
|
# Multiple statistics on grid
|
|
stats = df.mean(df.z, binby=[df.x, df.y], shape=(50, 50), delay=True)
|
|
counts = df.count(binby=[df.x, df.y], shape=(50, 50), delay=True)
|
|
results = vaex.execute([stats, counts])
|
|
```
|
|
|
|
## Handling Missing Data
|
|
|
|
Work with missing, null, and NaN values:
|
|
|
|
### Detecting Missing Data
|
|
|
|
```python
|
|
# Check for missing
|
|
df['age_missing'] = df.age.isna()
|
|
df['age_present'] = df.age.notna()
|
|
|
|
# Count missing
|
|
missing_count = df.age.isna().sum()
|
|
missing_pct = df.age.isna().mean() * 100
|
|
```
|
|
|
|
### Handling Missing Data
|
|
|
|
```python
|
|
# Filter out missing
|
|
df_clean = df[df.age.notna()]
|
|
|
|
# Fill missing with value
|
|
df['age_filled'] = df.age.fillna(0)
|
|
df['age_filled'] = df.age.fillna(df.age.mean())
|
|
|
|
# Forward/backward fill (for time series)
|
|
df['age_ffill'] = df.age.fillna(method='ffill')
|
|
df['age_bfill'] = df.age.fillna(method='bfill')
|
|
```
|
|
|
|
### Missing Data Types in Vaex
|
|
|
|
Vaex distinguishes between:
|
|
- **NaN** - IEEE floating point Not-a-Number
|
|
- **NA** - Arrow null type
|
|
- **Missing** - General term for absent data
|
|
|
|
```python
|
|
# Check which missing type
|
|
df.is_masked('column_name') # True if uses Arrow null (NA)
|
|
|
|
# Convert between types
|
|
df['col_masked'] = df.col.as_masked() # Convert to NA representation
|
|
```
|
|
|
|
## Sorting
|
|
|
|
```python
|
|
# Sort by single column
|
|
df_sorted = df.sort('age')
|
|
df_sorted = df.sort('age', ascending=False)
|
|
|
|
# Sort by multiple columns
|
|
df_sorted = df.sort(['category', 'age'])
|
|
|
|
# Note: Sorting materializes a new column with indices
|
|
# For very large datasets, consider if sorting is necessary
|
|
```
|
|
|
|
## Joining DataFrames
|
|
|
|
Combine DataFrames based on keys:
|
|
|
|
```python
|
|
# Inner join
|
|
df_joined = df1.join(df2, on='key_column')
|
|
|
|
# Left join
|
|
df_joined = df1.join(df2, on='key_column', how='left')
|
|
|
|
# Join on different column names
|
|
df_joined = df1.join(
|
|
df2,
|
|
left_on='id',
|
|
right_on='user_id',
|
|
how='left'
|
|
)
|
|
|
|
# Multiple key columns
|
|
df_joined = df1.join(df2, on=['key1', 'key2'])
|
|
```
|
|
|
|
## Adding and Removing Columns
|
|
|
|
### Adding Columns
|
|
|
|
```python
|
|
# Virtual column (no memory)
|
|
df['new_col'] = df.x + df.y
|
|
|
|
# From external array (must match length)
|
|
import numpy as np
|
|
new_data = np.random.rand(len(df))
|
|
df['random'] = new_data
|
|
|
|
# Constant value
|
|
df['constant'] = 42
|
|
```
|
|
|
|
### Removing Columns
|
|
|
|
```python
|
|
# Drop single column
|
|
df = df.drop('column_name')
|
|
|
|
# Drop multiple columns
|
|
df = df.drop(['col1', 'col2', 'col3'])
|
|
|
|
# Select specific columns (drop others)
|
|
df = df[['col1', 'col2', 'col3']]
|
|
```
|
|
|
|
### Renaming Columns
|
|
|
|
```python
|
|
# Rename single column
|
|
df = df.rename('old_name', 'new_name')
|
|
|
|
# Rename multiple columns
|
|
df = df.rename({
|
|
'old_name1': 'new_name1',
|
|
'old_name2': 'new_name2'
|
|
})
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
### Pattern: Complex Feature Engineering
|
|
|
|
```python
|
|
# Multiple derived features
|
|
df['log_price'] = df.price.log()
|
|
df['price_per_unit'] = df.price / df.quantity
|
|
df['is_discount'] = df.discount > 0
|
|
df['price_category'] = (df.price > 100).where('expensive', 'affordable')
|
|
df['revenue'] = df.price * df.quantity * (1 - df.discount)
|
|
```
|
|
|
|
### Pattern: Text Cleaning
|
|
|
|
```python
|
|
# Clean and standardize text
|
|
df['email_clean'] = df.email.str.lower().str.strip()
|
|
df['has_valid_email'] = df.email_clean.str.contains('@')
|
|
df['domain'] = df.email_clean.str.split('@')[1]
|
|
```
|
|
|
|
### Pattern: Time-based Analysis
|
|
|
|
```python
|
|
# Extract temporal features
|
|
df['year'] = df.timestamp.dt.year
|
|
df['month'] = df.timestamp.dt.month
|
|
df['day_of_week'] = df.timestamp.dt.dayofweek
|
|
df['is_weekend'] = df.day_of_week >= 5
|
|
df['quarter'] = ((df.month - 1) // 3) + 1
|
|
```
|
|
|
|
### Pattern: Grouped Statistics
|
|
|
|
```python
|
|
# Compute statistics by group
|
|
monthly_sales = df.groupby(df.timestamp.dt.month).agg({
|
|
'revenue': ['sum', 'mean', 'count'],
|
|
'quantity': 'sum'
|
|
})
|
|
|
|
# Multiple grouping levels
|
|
category_region_sales = df.groupby(['category', 'region']).agg({
|
|
'sales': 'sum',
|
|
'profit': 'mean'
|
|
})
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
1. **Use virtual columns** - They're computed on-the-fly with no memory cost
|
|
2. **Batch operations with delay=True** - Compute multiple aggregations at once
|
|
3. **Avoid `.values` or `.to_pandas_df()`** - Keep operations lazy when possible
|
|
4. **Use selections** - Multiple named selections are more efficient than creating new DataFrames
|
|
5. **Leverage expressions** - They enable query optimization
|
|
6. **Minimize sorting** - Sorting is expensive on large datasets
|
|
|
|
## Related Resources
|
|
|
|
- For DataFrame creation: See `core_dataframes.md`
|
|
- For performance optimization: See `performance.md`
|
|
- For visualization: See `visualization.md`
|
|
- For ML pipelines: See `machine_learning.md`
|