12 KiB
12 KiB
Data Processing and Manipulation
This reference covers filtering, selections, virtual columns, expressions, aggregations, groupby operations, and data transformations in Vaex.
Filtering and Selections
Vaex uses boolean expressions to filter data efficiently without copying:
Basic Filtering
# Simple filter
df_filtered = df[df.age > 25]
# Multiple conditions
df_filtered = df[(df.age > 25) & (df.salary > 50000)]
df_filtered = df[(df.category == 'A') | (df.category == 'B')]
# Negation
df_filtered = df[~(df.age < 18)]
Selection Objects
Vaex can maintain multiple named selections simultaneously:
# Create named selection
df.select(df.age > 30, name='adults')
df.select(df.salary > 100000, name='high_earners')
# Use selection in operations
mean_age_adults = df.mean(df.age, selection='adults')
count_high_earners = df.count(selection='high_earners')
# Combine selections
df.select((df.age > 30) & (df.salary > 100000), name='adult_high_earners')
# List all selections
print(df.selection_names())
# Drop selection
df.select_drop('adults')
Advanced Filtering
# String matching
df_filtered = df[df.name.str.contains('John')]
df_filtered = df[df.name.str.startswith('A')]
df_filtered = df[df.email.str.endswith('@gmail.com')]
# Null/missing value filtering
df_filtered = df[df.age.isna()] # Keep missing
df_filtered = df[df.age.notna()] # Remove missing
# Value membership
df_filtered = df[df.category.isin(['A', 'B', 'C'])]
# Range filtering
df_filtered = df[df.age.between(25, 65)]
Virtual Columns and Expressions
Virtual columns are computed on-the-fly with zero memory overhead:
Creating Virtual Columns
# Arithmetic operations
df['total'] = df.price * df.quantity
df['price_squared'] = df.price ** 2
# Mathematical functions
df['log_price'] = df.price.log()
df['sqrt_value'] = df.value.sqrt()
df['abs_diff'] = (df.x - df.y).abs()
# Conditional logic
df['is_adult'] = df.age >= 18
df['category'] = (df.score > 80).where('A', 'B') # If-then-else
Expression Methods
# Mathematical
df.x.abs() # Absolute value
df.x.sqrt() # Square root
df.x.log() # Natural log
df.x.log10() # Base-10 log
df.x.exp() # Exponential
# Trigonometric
df.angle.sin()
df.angle.cos()
df.angle.tan()
df.angle.arcsin()
# Rounding
df.x.round(2) # Round to 2 decimals
df.x.floor() # Round down
df.x.ceil() # Round up
# Type conversion
df.x.astype('int64')
df.x.astype('float32')
df.x.astype('str')
Conditional Expressions
# where() method: condition.where(true_value, false_value)
df['status'] = (df.age >= 18).where('adult', 'minor')
# Multiple conditions with nested where
df['grade'] = (df.score >= 90).where('A',
(df.score >= 80).where('B',
(df.score >= 70).where('C', 'F')))
# Using searchsorted for binning
bins = [0, 18, 65, 100]
labels = ['minor', 'adult', 'senior']
df['age_group'] = df.age.searchsorted(bins).where(...)
String Operations
Access string methods via the .str accessor:
Basic String Methods
# Case conversion
df['upper_name'] = df.name.str.upper()
df['lower_name'] = df.name.str.lower()
df['title_name'] = df.name.str.title()
# Trimming
df['trimmed'] = df.text.str.strip()
df['ltrimmed'] = df.text.str.lstrip()
df['rtrimmed'] = df.text.str.rstrip()
# Searching
df['has_john'] = df.name.str.contains('John')
df['starts_with_a'] = df.name.str.startswith('A')
df['ends_with_com'] = df.email.str.endswith('.com')
# Slicing
df['first_char'] = df.name.str.slice(0, 1)
df['last_three'] = df.name.str.slice(-3, None)
# Length
df['name_length'] = df.name.str.len()
Advanced String Operations
# Replacing
df['clean_text'] = df.text.str.replace('bad', 'good')
# Splitting (returns first part)
df['first_name'] = df.full_name.str.split(' ')[0]
# Concatenation
df['full_name'] = df.first_name + ' ' + df.last_name
# Padding
df['padded'] = df.code.str.pad(10, '0', 'left') # Zero-padding
DateTime Operations
Access datetime methods via the .dt accessor:
DateTime Properties
# Parsing strings to datetime
df['date_parsed'] = df.date_string.astype('datetime64')
# Extracting components
df['year'] = df.timestamp.dt.year
df['month'] = df.timestamp.dt.month
df['day'] = df.timestamp.dt.day
df['hour'] = df.timestamp.dt.hour
df['minute'] = df.timestamp.dt.minute
df['second'] = df.timestamp.dt.second
# Day of week
df['weekday'] = df.timestamp.dt.dayofweek # 0=Monday
df['day_name'] = df.timestamp.dt.day_name # 'Monday', 'Tuesday', ...
# Date arithmetic
df['tomorrow'] = df.date + pd.Timedelta(days=1)
df['next_week'] = df.date + pd.Timedelta(weeks=1)
Aggregations
Vaex performs aggregations efficiently across billions of rows:
Basic Aggregations
# Single column
mean_age = df.age.mean()
std_age = df.age.std()
min_age = df.age.min()
max_age = df.age.max()
sum_sales = df.sales.sum()
count_rows = df.count()
# With selections
mean_adult_age = df.age.mean(selection='adults')
# Multiple at once with delay
mean = df.age.mean(delay=True)
std = df.age.std(delay=True)
results = vaex.execute([mean, std])
Available Aggregation Functions
# Central tendency
df.x.mean()
df.x.median_approx() # Approximate median (fast)
# Dispersion
df.x.std() # Standard deviation
df.x.var() # Variance
df.x.min()
df.x.max()
df.x.minmax() # Both min and max
# Count
df.count() # Total rows
df.x.count() # Non-missing values
# Sum and product
df.x.sum()
df.x.prod()
# Percentiles
df.x.quantile(0.5) # Median
df.x.quantile([0.25, 0.75]) # Quartiles
# Correlation
df.correlation(df.x, df.y)
df.covar(df.x, df.y)
# Higher moments
df.x.kurtosis()
df.x.skew()
# Unique values
df.x.nunique() # Count unique
df.x.unique() # Get unique values (returns array)
GroupBy Operations
Group data and compute aggregations per group:
Basic GroupBy
# Single column groupby
grouped = df.groupby('category')
# Aggregation
result = grouped.agg({'sales': 'sum'})
result = grouped.agg({'sales': 'sum', 'quantity': 'mean'})
# Multiple aggregations on same column
result = grouped.agg({
'sales': ['sum', 'mean', 'std'],
'quantity': 'sum'
})
Advanced GroupBy
# Multiple grouping columns
result = df.groupby(['category', 'region']).agg({
'sales': 'sum',
'quantity': 'mean'
})
# Custom aggregation functions
result = df.groupby('category').agg({
'sales': lambda x: x.max() - x.min()
})
# Available aggregation functions
# 'sum', 'mean', 'std', 'min', 'max', 'count', 'first', 'last'
GroupBy with Binning
# Bin continuous variable and aggregate
result = df.groupby(vaex.vrange(0, 100, 10)).agg({
'sales': 'sum'
})
# Datetime binning
result = df.groupby(df.timestamp.dt.year).agg({
'sales': 'sum'
})
Binning and Discretization
Create bins from continuous variables:
Simple Binning
# Create bins
df['age_bin'] = df.age.digitize([18, 30, 50, 65, 100])
# Labeled bins
bins = [0, 18, 30, 50, 65, 100]
labels = ['child', 'young_adult', 'adult', 'middle_age', 'senior']
df['age_group'] = df.age.digitize(bins)
# Note: Apply labels using where() or mapping
Statistical Binning
# Equal-width bins
df['value_bin'] = df.value.digitize(
vaex.vrange(df.value.min(), df.value.max(), 10)
)
# Quantile-based bins
quantiles = df.value.quantile([0.25, 0.5, 0.75])
df['value_quartile'] = df.value.digitize(quantiles)
Multi-dimensional Aggregations
Compute statistics on grids:
# 2D histogram/heatmap data
counts = df.count(binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(100, 100))
# Mean on a grid
mean_z = df.mean(df.z, binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(50, 50))
# Multiple statistics on grid
stats = df.mean(df.z, binby=[df.x, df.y], shape=(50, 50), delay=True)
counts = df.count(binby=[df.x, df.y], shape=(50, 50), delay=True)
results = vaex.execute([stats, counts])
Handling Missing Data
Work with missing, null, and NaN values:
Detecting Missing Data
# Check for missing
df['age_missing'] = df.age.isna()
df['age_present'] = df.age.notna()
# Count missing
missing_count = df.age.isna().sum()
missing_pct = df.age.isna().mean() * 100
Handling Missing Data
# Filter out missing
df_clean = df[df.age.notna()]
# Fill missing with value
df['age_filled'] = df.age.fillna(0)
df['age_filled'] = df.age.fillna(df.age.mean())
# Forward/backward fill (for time series)
df['age_ffill'] = df.age.fillna(method='ffill')
df['age_bfill'] = df.age.fillna(method='bfill')
Missing Data Types in Vaex
Vaex distinguishes between:
- NaN - IEEE floating point Not-a-Number
- NA - Arrow null type
- Missing - General term for absent data
# Check which missing type
df.is_masked('column_name') # True if uses Arrow null (NA)
# Convert between types
df['col_masked'] = df.col.as_masked() # Convert to NA representation
Sorting
# Sort by single column
df_sorted = df.sort('age')
df_sorted = df.sort('age', ascending=False)
# Sort by multiple columns
df_sorted = df.sort(['category', 'age'])
# Note: Sorting materializes a new column with indices
# For very large datasets, consider if sorting is necessary
Joining DataFrames
Combine DataFrames based on keys:
# Inner join
df_joined = df1.join(df2, on='key_column')
# Left join
df_joined = df1.join(df2, on='key_column', how='left')
# Join on different column names
df_joined = df1.join(
df2,
left_on='id',
right_on='user_id',
how='left'
)
# Multiple key columns
df_joined = df1.join(df2, on=['key1', 'key2'])
Adding and Removing Columns
Adding Columns
# Virtual column (no memory)
df['new_col'] = df.x + df.y
# From external array (must match length)
import numpy as np
new_data = np.random.rand(len(df))
df['random'] = new_data
# Constant value
df['constant'] = 42
Removing Columns
# Drop single column
df = df.drop('column_name')
# Drop multiple columns
df = df.drop(['col1', 'col2', 'col3'])
# Select specific columns (drop others)
df = df[['col1', 'col2', 'col3']]
Renaming Columns
# Rename single column
df = df.rename('old_name', 'new_name')
# Rename multiple columns
df = df.rename({
'old_name1': 'new_name1',
'old_name2': 'new_name2'
})
Common Patterns
Pattern: Complex Feature Engineering
# Multiple derived features
df['log_price'] = df.price.log()
df['price_per_unit'] = df.price / df.quantity
df['is_discount'] = df.discount > 0
df['price_category'] = (df.price > 100).where('expensive', 'affordable')
df['revenue'] = df.price * df.quantity * (1 - df.discount)
Pattern: Text Cleaning
# Clean and standardize text
df['email_clean'] = df.email.str.lower().str.strip()
df['has_valid_email'] = df.email_clean.str.contains('@')
df['domain'] = df.email_clean.str.split('@')[1]
Pattern: Time-based Analysis
# Extract temporal features
df['year'] = df.timestamp.dt.year
df['month'] = df.timestamp.dt.month
df['day_of_week'] = df.timestamp.dt.dayofweek
df['is_weekend'] = df.day_of_week >= 5
df['quarter'] = ((df.month - 1) // 3) + 1
Pattern: Grouped Statistics
# Compute statistics by group
monthly_sales = df.groupby(df.timestamp.dt.month).agg({
'revenue': ['sum', 'mean', 'count'],
'quantity': 'sum'
})
# Multiple grouping levels
category_region_sales = df.groupby(['category', 'region']).agg({
'sales': 'sum',
'profit': 'mean'
})
Performance Tips
- Use virtual columns - They're computed on-the-fly with no memory cost
- Batch operations with delay=True - Compute multiple aggregations at once
- Avoid
.valuesor.to_pandas_df()- Keep operations lazy when possible - Use selections - Multiple named selections are more efficient than creating new DataFrames
- Leverage expressions - They enable query optimization
- Minimize sorting - Sorting is expensive on large datasets
Related Resources
- For DataFrame creation: See
core_dataframes.md - For performance optimization: See
performance.md - For visualization: See
visualization.md - For ML pipelines: See
machine_learning.md