gh-k-dense-ai-claude-scient…/skills/vaex/references/data_processing.md

# Data Processing and Manipulation

This reference covers filtering, selections, virtual columns, expressions, aggregations, groupby operations, and data transformations in Vaex.

## Filtering and Selections

Vaex uses boolean expressions to filter data efficiently without copying:

### Basic Filtering

```python
# Simple filter
df_filtered = df[df.age > 25]

# Multiple conditions
df_filtered = df[(df.age > 25) & (df.salary > 50000)]
df_filtered = df[(df.category == 'A') | (df.category == 'B')]

# Negation
df_filtered = df[~(df.age < 18)]
```

### Selection Objects

Vaex can maintain multiple named selections simultaneously:

```python
# Create named selection
df.select(df.age > 30, name='adults')
df.select(df.salary > 100000, name='high_earners')

# Use selection in operations
mean_age_adults = df.mean(df.age, selection='adults')
count_high_earners = df.count(selection='high_earners')

# Combine selections
df.select((df.age > 30) & (df.salary > 100000), name='adult_high_earners')

# List all selections
print(df.selection_names())

# Drop selection
df.select_drop('adults')
```

### Advanced Filtering

```python
# String matching
df_filtered = df[df.name.str.contains('John')]
df_filtered = df[df.name.str.startswith('A')]
df_filtered = df[df.email.str.endswith('@gmail.com')]

# Null/missing value filtering
df_filtered = df[df.age.isna()]      # Keep missing
df_filtered = df[df.age.notna()]     # Remove missing

# Value membership
df_filtered = df[df.category.isin(['A', 'B', 'C'])]

# Range filtering
df_filtered = df[df.age.between(25, 65)]
```

## Virtual Columns and Expressions

Virtual columns are computed on-the-fly with zero memory overhead:

### Creating Virtual Columns

```python
# Arithmetic operations
df['total'] = df.price * df.quantity
df['price_squared'] = df.price ** 2

# Mathematical functions
df['log_price'] = df.price.log()
df['sqrt_value'] = df.value.sqrt()
df['abs_diff'] = (df.x - df.y).abs()

# Conditional logic
df['is_adult'] = df.age >= 18
df['category'] = (df.score > 80).where('A', 'B')  # If-then-else
```

### Expression Methods

```python
# Mathematical
df.x.abs()          # Absolute value
df.x.sqrt()         # Square root
df.x.log()          # Natural log
df.x.log10()        # Base-10 log
df.x.exp()          # Exponential

# Trigonometric
df.angle.sin()
df.angle.cos()
df.angle.tan()
df.angle.arcsin()

# Rounding
df.x.round(2)       # Round to 2 decimals
df.x.floor()        # Round down
df.x.ceil()         # Round up

# Type conversion
df.x.astype('int64')
df.x.astype('float32')
df.x.astype('str')
```

### Conditional Expressions

```python
# where() method: condition.where(true_value, false_value)
df['status'] = (df.age >= 18).where('adult', 'minor')

# Multiple conditions with nested where
df['grade'] = (df.score >= 90).where('A',
              (df.score >= 80).where('B',
              (df.score >= 70).where('C', 'F')))

# Using searchsorted for binning
bins = [0, 18, 65, 100]
labels = ['minor', 'adult', 'senior']
df['age_group'] = df.age.searchsorted(bins).where(...)
```

## String Operations

Access string methods via the `.str` accessor:

### Basic String Methods

```python
# Case conversion
df['upper_name'] = df.name.str.upper()
df['lower_name'] = df.name.str.lower()
df['title_name'] = df.name.str.title()

# Trimming
df['trimmed'] = df.text.str.strip()
df['ltrimmed'] = df.text.str.lstrip()
df['rtrimmed'] = df.text.str.rstrip()

# Searching
df['has_john'] = df.name.str.contains('John')
df['starts_with_a'] = df.name.str.startswith('A')
df['ends_with_com'] = df.email.str.endswith('.com')

# Slicing
df['first_char'] = df.name.str.slice(0, 1)
df['last_three'] = df.name.str.slice(-3, None)

# Length
df['name_length'] = df.name.str.len()
```

### Advanced String Operations

```python
# Replacing
df['clean_text'] = df.text.str.replace('bad', 'good')

# Splitting (returns first part)
df['first_name'] = df.full_name.str.split(' ')[0]

# Concatenation
df['full_name'] = df.first_name + ' ' + df.last_name

# Padding
df['padded'] = df.code.str.pad(10, '0', 'left')  # Zero-padding
```

## DateTime Operations

Access datetime methods via the `.dt` accessor:

### DateTime Properties

```python
# Parsing strings to datetime
df['date_parsed'] = df.date_string.astype('datetime64')

# Extracting components
df['year'] = df.timestamp.dt.year
df['month'] = df.timestamp.dt.month
df['day'] = df.timestamp.dt.day
df['hour'] = df.timestamp.dt.hour
df['minute'] = df.timestamp.dt.minute
df['second'] = df.timestamp.dt.second

# Day of week
df['weekday'] = df.timestamp.dt.dayofweek  # 0=Monday
df['day_name'] = df.timestamp.dt.day_name  # 'Monday', 'Tuesday', ...

# Date arithmetic
df['tomorrow'] = df.date + pd.Timedelta(days=1)
df['next_week'] = df.date + pd.Timedelta(weeks=1)
```

## Aggregations

Vaex performs aggregations efficiently across billions of rows:

### Basic Aggregations

```python
# Single column
mean_age = df.age.mean()
std_age = df.age.std()
min_age = df.age.min()
max_age = df.age.max()
sum_sales = df.sales.sum()
count_rows = df.count()

# With selections
mean_adult_age = df.age.mean(selection='adults')

# Multiple at once with delay
mean = df.age.mean(delay=True)
std = df.age.std(delay=True)
results = vaex.execute([mean, std])
```

### Available Aggregation Functions

```python
# Central tendency
df.x.mean()
df.x.median_approx()  # Approximate median (fast)

# Dispersion
df.x.std()           # Standard deviation
df.x.var()           # Variance
df.x.min()
df.x.max()
df.x.minmax()        # Both min and max

# Count
df.count()           # Total rows
df.x.count()         # Non-missing values

# Sum and product
df.x.sum()
df.x.prod()

# Percentiles
df.x.quantile(0.5)           # Median
df.x.quantile([0.25, 0.75])  # Quartiles

# Correlation
df.correlation(df.x, df.y)
df.covar(df.x, df.y)

# Higher moments
df.x.kurtosis()
df.x.skew()

# Unique values
df.x.nunique()       # Count unique
df.x.unique()        # Get unique values (returns array)
```

## GroupBy Operations

Group data and compute aggregations per group:

### Basic GroupBy

```python
# Single column groupby
grouped = df.groupby('category')

# Aggregation
result = grouped.agg({'sales': 'sum'})
result = grouped.agg({'sales': 'sum', 'quantity': 'mean'})

# Multiple aggregations on same column
result = grouped.agg({
    'sales': ['sum', 'mean', 'std'],
    'quantity': 'sum'
})
```

### Advanced GroupBy

```python
# Multiple grouping columns
result = df.groupby(['category', 'region']).agg({
    'sales': 'sum',
    'quantity': 'mean'
})

# Custom aggregation functions
result = df.groupby('category').agg({
    'sales': lambda x: x.max() - x.min()
})

# Available aggregation functions
# 'sum', 'mean', 'std', 'min', 'max', 'count', 'first', 'last'
```

### GroupBy with Binning

```python
# Bin continuous variable and aggregate
result = df.groupby(vaex.vrange(0, 100, 10)).agg({
    'sales': 'sum'
})

# Datetime binning
result = df.groupby(df.timestamp.dt.year).agg({
    'sales': 'sum'
})
```

## Binning and Discretization

Create bins from continuous variables:

### Simple Binning

```python
# Create bins
df['age_bin'] = df.age.digitize([18, 30, 50, 65, 100])

# Labeled bins
bins = [0, 18, 30, 50, 65, 100]
labels = ['child', 'young_adult', 'adult', 'middle_age', 'senior']
df['age_group'] = df.age.digitize(bins)
# Note: Apply labels using where() or mapping
```

### Statistical Binning

```python
# Equal-width bins
df['value_bin'] = df.value.digitize(
    vaex.vrange(df.value.min(), df.value.max(), 10)
)

# Quantile-based bins
quantiles = df.value.quantile([0.25, 0.5, 0.75])
df['value_quartile'] = df.value.digitize(quantiles)
```

## Multi-dimensional Aggregations

Compute statistics on grids:

```python
# 2D histogram/heatmap data
counts = df.count(binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(100, 100))

# Mean on a grid
mean_z = df.mean(df.z, binby=[df.x, df.y], limits=[[0, 10], [0, 10]], shape=(50, 50))

# Multiple statistics on grid
stats = df.mean(df.z, binby=[df.x, df.y], shape=(50, 50), delay=True)
counts = df.count(binby=[df.x, df.y], shape=(50, 50), delay=True)
results = vaex.execute([stats, counts])
```

## Handling Missing Data

Work with missing, null, and NaN values:

### Detecting Missing Data

```python
# Check for missing
df['age_missing'] = df.age.isna()
df['age_present'] = df.age.notna()

# Count missing
missing_count = df.age.isna().sum()
missing_pct = df.age.isna().mean() * 100
```

### Handling Missing Data

```python
# Filter out missing
df_clean = df[df.age.notna()]

# Fill missing with value
df['age_filled'] = df.age.fillna(0)
df['age_filled'] = df.age.fillna(df.age.mean())

# Forward/backward fill (for time series)
df['age_ffill'] = df.age.fillna(method='ffill')
df['age_bfill'] = df.age.fillna(method='bfill')
```

### Missing Data Types in Vaex

Vaex distinguishes between:
- **NaN** - IEEE floating point Not-a-Number
- **NA** - Arrow null type
- **Missing** - General term for absent data

```python
# Check which missing type
df.is_masked('column_name')  # True if uses Arrow null (NA)

# Convert between types
df['col_masked'] = df.col.as_masked()  # Convert to NA representation
```

## Sorting

```python
# Sort by single column
df_sorted = df.sort('age')
df_sorted = df.sort('age', ascending=False)

# Sort by multiple columns
df_sorted = df.sort(['category', 'age'])

# Note: Sorting materializes a new column with indices
# For very large datasets, consider if sorting is necessary
```

## Joining DataFrames

Combine DataFrames based on keys:

```python
# Inner join
df_joined = df1.join(df2, on='key_column')

# Left join
df_joined = df1.join(df2, on='key_column', how='left')

# Join on different column names
df_joined = df1.join(
    df2,
    left_on='id',
    right_on='user_id',
    how='left'
)

# Multiple key columns
df_joined = df1.join(df2, on=['key1', 'key2'])
```

## Adding and Removing Columns

### Adding Columns

```python
# Virtual column (no memory)
df['new_col'] = df.x + df.y

# From external array (must match length)
import numpy as np
new_data = np.random.rand(len(df))
df['random'] = new_data

# Constant value
df['constant'] = 42
```

### Removing Columns

```python
# Drop single column
df = df.drop('column_name')

# Drop multiple columns
df = df.drop(['col1', 'col2', 'col3'])

# Select specific columns (drop others)
df = df[['col1', 'col2', 'col3']]
```

### Renaming Columns

```python
# Rename single column
df = df.rename('old_name', 'new_name')

# Rename multiple columns
df = df.rename({
    'old_name1': 'new_name1',
    'old_name2': 'new_name2'
})
```

## Common Patterns

### Pattern: Complex Feature Engineering

```python
# Multiple derived features
df['log_price'] = df.price.log()
df['price_per_unit'] = df.price / df.quantity
df['is_discount'] = df.discount > 0
df['price_category'] = (df.price > 100).where('expensive', 'affordable')
df['revenue'] = df.price * df.quantity * (1 - df.discount)
```

### Pattern: Text Cleaning

```python
# Clean and standardize text
df['email_clean'] = df.email.str.lower().str.strip()
df['has_valid_email'] = df.email_clean.str.contains('@')
df['domain'] = df.email_clean.str.split('@')[1]
```

### Pattern: Time-based Analysis

```python
# Extract temporal features
df['year'] = df.timestamp.dt.year
df['month'] = df.timestamp.dt.month
df['day_of_week'] = df.timestamp.dt.dayofweek
df['is_weekend'] = df.day_of_week >= 5
df['quarter'] = ((df.month - 1) // 3) + 1
```

### Pattern: Grouped Statistics

```python
# Compute statistics by group
monthly_sales = df.groupby(df.timestamp.dt.month).agg({
    'revenue': ['sum', 'mean', 'count'],
    'quantity': 'sum'
})

# Multiple grouping levels
category_region_sales = df.groupby(['category', 'region']).agg({
    'sales': 'sum',
    'profit': 'mean'
})
```

## Performance Tips

1. **Use virtual columns** - They're computed on-the-fly with no memory cost
2. **Batch operations with delay=True** - Compute multiple aggregations at once
3. **Avoid `.values` or `.to_pandas_df()`** - Keep operations lazy when possible
4. **Use selections** - Multiple named selections are more efficient than creating new DataFrames
5. **Leverage expressions** - They enable query optimization
6. **Minimize sorting** - Sorting is expensive on large datasets

## Related Resources

- For DataFrame creation: See `core_dataframes.md`
- For performance optimization: See `performance.md`
- For visualization: See `visualization.md`
- For ML pipelines: See `machine_learning.md`