# Data Processing and Manipulation This reference covers filtering, selections, virtual columns, expressions, aggregations, groupby operations, and data transformations in Vaex. ## Filtering and Selections Vaex uses boolean expressions to filter data efficiently without copying: ### Basic Filtering ```python # Simple filter df_filtered = df[df.age > 25] # Multiple conditions df_filtered = df[(df.age > 25) & (df.salary > 50000)] df_filtered = df[(df.category == 'A') | (df.category == 'B')] # Negation df_filtered = df[~(df.age < 18)] ``` ### Selection Objects Vaex can maintain multiple named selections simultaneously: ```python # Create named selection df.select(df.age > 30, name='adults') df.select(df.salary > 100000, name='high_earners') # Use selection in operations mean_age_adults = df.mean(df.age, selection='adults') count_high_earners = df.count(selection='high_earners') # Combine selections df.select((df.age > 30) & (df.salary > 100000), name='adult_high_earners') # List all selections print(df.selection_names()) # Drop selection df.select_drop('adults') ``` ### Advanced Filtering ```python # String matching df_filtered = df[df.name.str.contains('John')] df_filtered = df[df.name.str.startswith('A')] df_filtered = df[df.email.str.endswith('@gmail.com')] # Null/missing value filtering df_filtered = df[df.age.isna()] # Keep missing df_filtered = df[df.age.notna()] # Remove missing # Value membership df_filtered = df[df.category.isin(['A', 'B', 'C'])] # Range filtering df_filtered = df[df.age.between(25, 65)] ``` ## Virtual Columns and Expressions Virtual columns are computed on-the-fly with zero memory overhead: ### Creating Virtual Columns ```python # Arithmetic operations df['total'] = df.price * df.quantity df['price_squared'] = df.price ** 2 # Mathematical functions df['log_price'] = df.price.log() df['sqrt_value'] = df.value.sqrt() df['abs_diff'] = (df.x - df.y).abs() # Conditional logic df['is_adult'] = df.age >= 18 df['category'] = (df.score > 80).where('A', 'B') # If-then-else ``` ### Expression Methods ```python # Mathematical df.x.abs() # Absolute value df.x.sqrt() # Square root df.x.log() # Natural log df.x.log10() # Base-10 log df.x.exp() # Exponential # Trigonometric df.angle.sin() df.angle.cos() df.angle.tan() df.angle.arcsin() # Rounding df.x.round(2) # Round to 2 decimals df.x.floor() # Round down df.x.ceil() # Round up # Type conversion df.x.astype('int64') df.x.astype('float32') df.x.astype('str') ``` ### Conditional Expressions ```python # where() method: condition.where(true_value, false_value) df['status'] = (df.age >= 18).where('adult', 'minor') # Multiple conditions with nested where df['grade'] = (df.score >= 90).where('A', (df.score >= 80).where('B', (df.score >= 70).where('C', 'F'))) # Using searchsorted for binning bins = [0, 18, 65, 100] labels = ['minor', 'adult', 'senior'] df['age_group'] = df.age.searchsorted(bins).where(...) ``` ## String Operations Access string methods via the `.str` accessor: ### Basic String Methods ```python # Case conversion df['upper_name'] = df.name.str.upper() df['lower_name'] = df.name.str.lower() df['title_name'] = df.name.str.title() # Trimming df['trimmed'] = df.text.str.strip() df['ltrimmed'] = df.text.str.lstrip() df['rtrimmed'] = df.text.str.rstrip() # Searching df['has_john'] = df.name.str.contains('John') df['starts_with_a'] = df.name.str.startswith('A') df['ends_with_com'] = df.email.str.endswith('.com') # Slicing df['first_char'] = df.name.str.slice(0, 1) df['last_three'] = df.name.str.slice(-3, None) # Length df['name_length'] = df.name.str.len() ``` ### Advanced String Operations ```python # Replacing df['clean_text'] = df.text.str.replace('bad', 'good') # Splitting (returns first part) df['first_name'] = df.full_name.str.split(' ')[0] # Concatenation df['full_name'] = df.first_name + ' ' + df.last_name # Padding df['padded'] = df.code.str.pad(10, '0', 'left') # Zero-padding ``` ## DateTime Operations Access datetime methods via the `.dt` accessor: ### DateTime Properties ```python # Parsing strings to datetime df['date_parsed'] = df.date_string.astype('datetime64') # Extracting components df['year'] = df.timestamp.dt.year df['month'] = df.timestamp.dt.month df['day'] = df.timestamp.dt.day df['hour'] = df.timestamp.dt.hour df['minute'] = df.timestamp.dt.minute df['second'] = df.timestamp.dt.second # Day of week df['weekday'] = df.timestamp.dt.dayofweek # 0=Monday df['day_name'] = df.timestamp.dt.day_name # 'Monday', 'Tuesday', ... # Date arithmetic df['tomorrow'] = df.date + pd.Timedelta(days=1) df['next_week'] = df.date + pd.Timedelta(weeks=1) ``` ## Aggregations Vaex performs aggregations efficiently across billions of rows: ### Basic Aggregations ```python # Single column mean_age = df.age.mean() std_age = df.age.std() min_age = df.age.min() max_age = df.age.max() sum_sales = df.sales.sum() count_rows = df.count() # With selections mean_adult_age = df.age.mean(selection='adults') # Multiple at once with delay mean = df.age.mean(delay=True) std = df.age.std(delay=True) results = vaex.execute([mean, std]) ``` ### Available Aggregation Functions ```python # Central tendency df.x.mean() df.x.median_approx() # Approximate median (fast) # Dispersion df.x.std() # Standard deviation df.x.var() # Variance df.x.min() df.x.max() df.x.minmax() # Both min and max # Count df.count() # Total rows df.x.count() # Non-missing values # Sum and product df.x.sum() df.x.prod() # Percentiles df.x.quantile(0.5) # Median df.x.quantile([0.25, 0.75]) # Quartiles # Correlation df.correlation(df.x, df.y) df.covar(df.x, df.y) # Higher moments df.x.kurtosis() df.x.skew() # Unique values df.x.nunique() # Count unique df.x.unique() # Get unique values (returns array) ``` ## GroupBy Operations Group data and compute aggregations per group: ### Basic GroupBy ```python # Single column groupby grouped = df.groupby('category') # Aggregation result = grouped.agg({'sales': 'sum'}) result = grouped.agg({'sales': 'sum', 'quantity': 'mean'}) # Multiple aggregations on same column result = grouped.agg({ 'sales': ['sum', 'mean', 'std'], 'quantity': 'sum' }) ``` ### Advanced GroupBy ```python # Multiple grouping columns result = df.groupby(['category', 'region']).agg({ 'sales': 'sum', 'quantity': 'mean' }) # Custom aggregation functions result = df.groupby('category').agg({ 'sales': lambda x: x.max() - x.min() }) # Available aggregation functions # 'sum', 'mean', 'std', 'min', 'max', 'count', 'first', 'last' ``` ### GroupBy with Binning ```python # Bin continuous variable and aggregate result = df.groupby(vaex.vrange(0, 100, 10)).agg({ 'sales': 'sum' }) # Datetime binning result = df.groupby(df.timestamp.dt.year).agg({ 'sales': 'sum' }) ``` ## Binning and Discretization Create bins from continuous variables: ### Simple Binning ```python # Create bins df['age_bin'] = df.age.digitize([18, 30, 50, 65, 100]) # Labeled bins bins = [0, 18, 30, 50, 65, 100] labels = ['child', 'young_adult', 'adult', 'middle_age', 'senior'] df['age_group'] = df.age.digitize(bins) # Note: Apply labels using where() or mapping ``` ### Statistical Binning ```python # Equal-width bins df['value_bin'] = df.value.digitize( vaex.vrange(df.value.min(), df.value.max(), 10) ) # Quantile-based bins quantiles = df.value.quantile([0.25, 0.5, 0.75]) df['value_quartile'] = df.value.digitize(quantiles) ``` ## Multi-dimensional Aggregations Compute statistics on grids: ```python # 2D histogram/heatmap data counts = df.count(binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(100, 100)) # Mean on a grid mean_z = df.mean(df.z, binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(50, 50)) # Multiple statistics on grid stats = df.mean(df.z, binby=[df.x, df.y], shape=(50, 50), delay=True) counts = df.count(binby=[df.x, df.y], shape=(50, 50), delay=True) results = vaex.execute([stats, counts]) ``` ## Handling Missing Data Work with missing, null, and NaN values: ### Detecting Missing Data ```python # Check for missing df['age_missing'] = df.age.isna() df['age_present'] = df.age.notna() # Count missing missing_count = df.age.isna().sum() missing_pct = df.age.isna().mean() * 100 ``` ### Handling Missing Data ```python # Filter out missing df_clean = df[df.age.notna()] # Fill missing with value df['age_filled'] = df.age.fillna(0) df['age_filled'] = df.age.fillna(df.age.mean()) # Forward/backward fill (for time series) df['age_ffill'] = df.age.fillna(method='ffill') df['age_bfill'] = df.age.fillna(method='bfill') ``` ### Missing Data Types in Vaex Vaex distinguishes between: - **NaN** - IEEE floating point Not-a-Number - **NA** - Arrow null type - **Missing** - General term for absent data ```python # Check which missing type df.is_masked('column_name') # True if uses Arrow null (NA) # Convert between types df['col_masked'] = df.col.as_masked() # Convert to NA representation ``` ## Sorting ```python # Sort by single column df_sorted = df.sort('age') df_sorted = df.sort('age', ascending=False) # Sort by multiple columns df_sorted = df.sort(['category', 'age']) # Note: Sorting materializes a new column with indices # For very large datasets, consider if sorting is necessary ``` ## Joining DataFrames Combine DataFrames based on keys: ```python # Inner join df_joined = df1.join(df2, on='key_column') # Left join df_joined = df1.join(df2, on='key_column', how='left') # Join on different column names df_joined = df1.join( df2, left_on='id', right_on='user_id', how='left' ) # Multiple key columns df_joined = df1.join(df2, on=['key1', 'key2']) ``` ## Adding and Removing Columns ### Adding Columns ```python # Virtual column (no memory) df['new_col'] = df.x + df.y # From external array (must match length) import numpy as np new_data = np.random.rand(len(df)) df['random'] = new_data # Constant value df['constant'] = 42 ``` ### Removing Columns ```python # Drop single column df = df.drop('column_name') # Drop multiple columns df = df.drop(['col1', 'col2', 'col3']) # Select specific columns (drop others) df = df[['col1', 'col2', 'col3']] ``` ### Renaming Columns ```python # Rename single column df = df.rename('old_name', 'new_name') # Rename multiple columns df = df.rename({ 'old_name1': 'new_name1', 'old_name2': 'new_name2' }) ``` ## Common Patterns ### Pattern: Complex Feature Engineering ```python # Multiple derived features df['log_price'] = df.price.log() df['price_per_unit'] = df.price / df.quantity df['is_discount'] = df.discount > 0 df['price_category'] = (df.price > 100).where('expensive', 'affordable') df['revenue'] = df.price * df.quantity * (1 - df.discount) ``` ### Pattern: Text Cleaning ```python # Clean and standardize text df['email_clean'] = df.email.str.lower().str.strip() df['has_valid_email'] = df.email_clean.str.contains('@') df['domain'] = df.email_clean.str.split('@')[1] ``` ### Pattern: Time-based Analysis ```python # Extract temporal features df['year'] = df.timestamp.dt.year df['month'] = df.timestamp.dt.month df['day_of_week'] = df.timestamp.dt.dayofweek df['is_weekend'] = df.day_of_week >= 5 df['quarter'] = ((df.month - 1) // 3) + 1 ``` ### Pattern: Grouped Statistics ```python # Compute statistics by group monthly_sales = df.groupby(df.timestamp.dt.month).agg({ 'revenue': ['sum', 'mean', 'count'], 'quantity': 'sum' }) # Multiple grouping levels category_region_sales = df.groupby(['category', 'region']).agg({ 'sales': 'sum', 'profit': 'mean' }) ``` ## Performance Tips 1. **Use virtual columns** - They're computed on-the-fly with no memory cost 2. **Batch operations with delay=True** - Compute multiple aggregations at once 3. **Avoid `.values` or `.to_pandas_df()`** - Keep operations lazy when possible 4. **Use selections** - Multiple named selections are more efficient than creating new DataFrames 5. **Leverage expressions** - They enable query optimization 6. **Minimize sorting** - Sorting is expensive on large datasets ## Related Resources - For DataFrame creation: See `core_dataframes.md` - For performance optimization: See `performance.md` - For visualization: See `visualization.md` - For ML pipelines: See `machine_learning.md`