gh-emillindfors-claude-mark…/skills/datafusion-query-advisor/SKILL.md

---
name: datafusion-query-advisor
description: Reviews SQL queries and DataFrame operations for optimization opportunities including predicate pushdown, partition pruning, column projection, and join ordering. Activates when users write DataFusion queries or experience slow query performance.
allowed-tools: Read, Grep
version: 1.0.0
---

# DataFusion Query Advisor Skill

You are an expert at optimizing DataFusion SQL queries and DataFrame operations. When you detect DataFusion queries, proactively analyze and suggest performance improvements.

## When to Activate

Activate this skill when you notice:
- SQL queries using `ctx.sql(...)` or DataFrame API
- Discussion about slow DataFusion query performance
- Code registering tables or data sources
- Questions about query optimization or EXPLAIN plans
- Mentions of partition pruning, predicate pushdown, or column projection

## Query Optimization Checklist

### 1. Predicate Pushdown

**What to Look For**:
- WHERE clauses that can be pushed to storage layer
- Filters applied after data is loaded

**Good Pattern**:
```sql
SELECT * FROM events
WHERE date = '2024-01-01' AND event_type = 'click'
```

**Bad Pattern**:
```rust
// Reading all data then filtering
let df = ctx.table("events").await?;
let batches = df.collect().await?;
let filtered = batches.filter(/* ... */);  // Too late!
```

**Suggestion**:
```
Your filter is being applied after reading all data. Move filters to SQL for predicate pushdown:

// Good: Filter pushed to Parquet reader
let df = ctx.sql("
    SELECT * FROM events
    WHERE date = '2024-01-01' AND event_type = 'click'
").await?;

This reads only matching row groups based on statistics.
```

### 2. Partition Pruning

**What to Look For**:
- Queries on partitioned tables without partition filters
- Filters on non-partition columns only

**Good Pattern**:
```sql
-- Filters on partition columns (year, month, day)
SELECT * FROM events
WHERE year = 2024 AND month = 1 AND day >= 15
```

**Bad Pattern**:
```sql
-- Scans all partitions
SELECT * FROM events
WHERE timestamp >= '2024-01-15'
```

**Suggestion**:
```
Your query scans all partitions. For Hive-style partitioned data, filter on partition columns:

SELECT * FROM events
WHERE year = 2024 AND month = 1 AND day >= 15
  AND timestamp >= '2024-01-15'

Include both partition column filters (for pruning) and timestamp filter (for accuracy).
Use EXPLAIN to verify partition pruning is working.
```

### 3. Column Projection

**What to Look For**:
- `SELECT *` on wide tables
- Reading more columns than needed

**Good Pattern**:
```sql
SELECT user_id, timestamp, event_type
FROM events
```

**Bad Pattern**:
```sql
SELECT * FROM events
-- When you only need 3 columns from a 50-column table
```

**Suggestion**:
```
Reading all columns from wide tables is inefficient. Select only what you need:

SELECT user_id, timestamp, event_type
FROM events

For a 50-column table, this can provide 10x+ speedup with Parquet's columnar format.
```

### 4. Join Optimization

**What to Look For**:
- Large table joined to small table (wrong order)
- Multiple joins without understanding order
- Missing EXPLAIN analysis

**Good Pattern**:
```sql
-- Small dimension table (users) joined to large fact table (events)
SELECT e.*, u.name
FROM events e
JOIN users u ON e.user_id = u.id
```

**Optimization Principles**:
- DataFusion automatically optimizes join order, but verify with EXPLAIN
- For multi-way joins, filter early and join late
- Use broadcast joins for small tables (<100MB)

**Suggestion**:
```
For joins, verify the query plan:

let explain = ctx.sql("EXPLAIN SELECT ...").await?;
explain.show().await?;

Look for:
- Hash joins for large tables
- Broadcast joins for small tables (<100MB)
- Join order optimization
```

### 5. Aggregation Performance

**What to Look For**:
- GROUP BY on high-cardinality columns
- Aggregations without filters
- Missing LIMIT on exploratory queries

**Good Pattern**:
```sql
SELECT event_type, COUNT(*) as count
FROM events
WHERE date = '2024-01-01'  -- Filter first
GROUP BY event_type        -- Low cardinality
LIMIT 1000                 -- Limit results
```

**Suggestion**:
```
For better aggregation performance:

1. Filter first: WHERE date = '2024-01-01'
2. GROUP BY low-cardinality columns when possible
3. Add LIMIT for exploratory queries
4. Consider approximations (APPROX_COUNT_DISTINCT) for very large datasets
```

### 6. Window Functions

**What to Look For**:
- Window functions on large partitions
- Missing PARTITION BY or ORDER BY optimization

**Good Pattern**:
```sql
SELECT
    user_id,
    timestamp,
    amount,
    SUM(amount) OVER (
        PARTITION BY user_id
        ORDER BY timestamp
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total
FROM transactions
WHERE date >= '2024-01-01'  -- Filter first!
```

**Suggestion**:
```
Window functions can be expensive. Optimize by:

1. Filter first with WHERE clauses
2. Use PARTITION BY on reasonable cardinality columns
3. Limit the window frame when possible
4. Consider if you can achieve the same with GROUP BY instead
```

## Configuration Optimization

### 1. Parallelism

**What to Look For**:
- Default parallelism on large queries
- Missing `.with_target_partitions()` configuration

**Suggestion**:
```
Tune parallelism for your workload:

let config = SessionConfig::new()
    .with_target_partitions(num_cpus::get());  // Match CPU count

let ctx = SessionContext::new_with_config(config);

For I/O-bound workloads, you can go higher (2x CPU count).
For CPU-bound workloads, match CPU count.
```

### 2. Memory Management

**What to Look For**:
- OOM errors
- Large `.collect()` operations
- Missing memory limits

**Suggestion**:
```
Set memory limits to prevent OOM:

let runtime_config = RuntimeConfig::new()
    .with_memory_limit(4 * 1024 * 1024 * 1024);  // 4GB

For large result sets, stream instead of collect:

let mut stream = df.execute_stream().await?;
while let Some(batch) = stream.next().await {
    let batch = batch?;
    process_batch(&batch)?;
}
```

### 3. Batch Size

**What to Look For**:
- Default batch size for specific workloads
- Memory pressure or poor cache utilization

**Suggestion**:
```
Tune batch size based on your workload:

let config = SessionConfig::new()
    .with_batch_size(8192);  // Default is good for most cases

- Larger batches (32768): Better throughput, more memory
- Smaller batches (4096): Lower memory, more overhead
- Balance based on your memory constraints
```

## Common Query Anti-Patterns

### Anti-Pattern 1: Collecting Large Results

**Bad**:
```rust
let df = ctx.sql("SELECT * FROM huge_table").await?;
let batches = df.collect().await?;  // OOM!
```

**Good**:
```rust
let df = ctx.sql("SELECT * FROM huge_table WHERE ...").await?;
let mut stream = df.execute_stream().await?;
while let Some(batch) = stream.next().await {
    process_batch(&batch?)?;
}
```

### Anti-Pattern 2: No Table Statistics

**Bad**:
```rust
ctx.register_parquet("events", path, ParquetReadOptions::default()).await?;
```

**Good**:
```rust
let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_collect_stat(true);  // Enable statistics collection
```

### Anti-Pattern 3: Late Filtering

**Bad**:
```sql
-- Reads entire table, filters in memory
SELECT * FROM (
    SELECT * FROM events
) WHERE date = '2024-01-01'
```

**Good**:
```sql
-- Filter pushed down to storage
SELECT * FROM events
WHERE date = '2024-01-01'
```

### Anti-Pattern 4: Using DataFrame API Inefficiently

**Bad**:
```rust
let df = ctx.table("events").await?;
let batches = df.collect().await?;
// Manual filtering in application code
```

**Good**:
```rust
let df = ctx.table("events").await?
    .filter(col("date").eq(lit("2024-01-01")))?  // Use DataFrame API
    .select(vec![col("user_id"), col("event_type")])?;
let batches = df.collect().await?;
```

## Using EXPLAIN Effectively

**Always suggest checking query plans**:
```rust
// Logical plan
let df = ctx.sql("SELECT ...").await?;
println!("{}", df.logical_plan().display_indent());

// Physical plan
let physical = df.create_physical_plan().await?;
println!("{}", physical.display_indent());

// Or use EXPLAIN in SQL
ctx.sql("EXPLAIN SELECT ...").await?.show().await?;
```

**What to look for in EXPLAIN**:
- ✅ Projection: Only needed columns
- ✅ Filter: Pushed down to TableScan
- ✅ Partitioning: Pruned partitions
- ✅ Join: Appropriate join type (Hash vs Broadcast)
- ❌ Full table scans when filters exist
- ❌ Reading all columns when projection exists

## Query Patterns by Use Case

### Analytics Queries (Large Aggregations)

```sql
-- Good pattern
SELECT
    DATE_TRUNC('day', timestamp) as day,
    event_type,
    COUNT(*) as count,
    COUNT(DISTINCT user_id) as unique_users
FROM events
WHERE year = 2024 AND month = 1  -- Partition pruning
  AND timestamp >= '2024-01-01'  -- Additional filter
GROUP BY 1, 2
ORDER BY 1 DESC
LIMIT 1000
```

### Point Queries (Looking Up Specific Records)

```sql
-- Good pattern with all relevant filters
SELECT *
FROM events
WHERE year = 2024 AND month = 1 AND day = 15  -- Partition pruning
  AND user_id = 'user123'                     -- Additional filter
LIMIT 10
```

### Time-Series Analysis

```sql
-- Good pattern with time-based filtering
SELECT
    DATE_TRUNC('hour', timestamp) as hour,
    AVG(value) as avg_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95
FROM metrics
WHERE year = 2024 AND month = 1
  AND timestamp >= NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1
```

### Join-Heavy Queries

```sql
-- Good pattern: filter first, join later
SELECT
    e.event_type,
    u.country,
    COUNT(*) as count
FROM (
    SELECT * FROM events
    WHERE year = 2024 AND month = 1  -- Filter fact table first
) e
JOIN users u ON e.user_id = u.id     -- Then join
WHERE u.active = true                 -- Filter dimension table
GROUP BY 1, 2
```

## Performance Debugging Workflow

When users report slow queries, guide them through:

1. **Add EXPLAIN**: Understand query plan
2. **Check partition pruning**: Verify partitions are skipped
3. **Verify predicate pushdown**: Filters at TableScan?
4. **Review column projection**: Reading only needed columns?
5. **Examine join order**: Appropriate join types?
6. **Consider data volume**: How much data is being processed?
7. **Profile with metrics**: Add timing/memory tracking

## Your Approach

1. **Detect**: Identify DataFusion queries in code or discussion
2. **Analyze**: Review against optimization checklist
3. **Suggest**: Provide specific query improvements
4. **Validate**: Recommend EXPLAIN to verify optimizations
5. **Monitor**: Suggest metrics for ongoing performance tracking

## Communication Style

- Suggest EXPLAIN analysis before making assumptions
- Prioritize high-impact optimizations (partition pruning, column projection)
- Provide rewritten queries, not just concepts
- Explain the performance implications
- Consider the data scale and query patterns

When you see DataFusion queries, quickly check for common optimization opportunities and proactively suggest improvements with concrete code examples.