gh-k-dense-ai-claude-scient…/skills/polars/references/best_practices.md

# Polars Best Practices and Performance Guide

Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.

## Performance Optimization

### 1. Use Lazy Evaluation

**Always prefer lazy mode for large datasets:**

```python
# Bad: Eager mode loads everything immediately
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")

# Good: Lazy mode optimizes before execution
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
```

**Benefits of lazy evaluation:**
- Predicate pushdown (filter at source)
- Projection pushdown (read only needed columns)
- Query optimization
- Parallel execution planning

### 2. Filter and Select Early

Push filters and column selection as early as possible in the pipeline:

```python
# Bad: Process all data, then filter and select
result = (
    lf.group_by("category")
    .agg(pl.col("value").mean())
    .join(other, on="category")
    .filter(pl.col("value") > 100)
    .select("category", "value")
)

# Good: Filter and select early
result = (
    lf.select("category", "value")  # Only needed columns
    .filter(pl.col("value") > 100)  # Filter early
    .group_by("category")
    .agg(pl.col("value").mean())
    .join(other.select("category", "other_col"), on="category")
)
```

### 3. Avoid Python Functions

Stay within the expression API to maintain parallelization:

```python
# Bad: Python function disables parallelization
df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)

# Good: Use native expressions (parallelized)
df = df.with_columns(result=pl.col("value") * 2)
```

**When you must use custom functions:**
```python
# If truly needed, be explicit
df = df.with_columns(
    result=pl.col("value").map_elements(
        custom_function,
        return_dtype=pl.Float64,
        skip_nulls=True  # Optimize null handling
    )
)
```

### 4. Use Streaming for Very Large Data

Enable streaming for datasets larger than RAM:

```python
# Streaming mode processes data in chunks
lf = pl.scan_parquet("very_large.parquet")
result = lf.filter(pl.col("value") > 100).collect(streaming=True)

# Or use sink for direct streaming writes
lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")
```

### 5. Optimize Data Types

Choose appropriate data types to reduce memory and improve performance:

```python
# Bad: Default types may be wasteful
df = pl.read_csv("data.csv")

# Good: Specify optimal types
df = pl.read_csv(
    "data.csv",
    dtypes={
        "id": pl.UInt32,  # Instead of Int64 if values fit
        "category": pl.Categorical,  # For low-cardinality strings
        "date": pl.Date,  # Instead of String
        "small_int": pl.Int16,  # Instead of Int64
    }
)
```

**Type optimization guidelines:**
- Use smallest integer type that fits your data
- Use `Categorical` for strings with low cardinality (<50% unique)
- Use `Date` instead of `Datetime` when time isn't needed
- Use `Boolean` instead of integers for binary flags

### 6. Parallel Operations

Structure code to maximize parallelization:

```python
# Bad: Sequential pipe operations disable parallelization
df = (
    df.pipe(operation1)
    .pipe(operation2)
    .pipe(operation3)
)

# Good: Combined operations enable parallelization
df = df.with_columns(
    result1=operation1_expr(),
    result2=operation2_expr(),
    result3=operation3_expr()
)
```

### 7. Rechunk After Concatenation

```python
# Concatenation can fragment data
combined = pl.concat([df1, df2, df3])

# Rechunk for better performance in subsequent operations
combined = pl.concat([df1, df2, df3], rechunk=True)
```

## Expression Patterns

### Conditional Logic

**Simple conditions:**
```python
df.with_columns(
    status=pl.when(pl.col("age") >= 18)
        .then("adult")
        .otherwise("minor")
)
```

**Multiple conditions:**
```python
df.with_columns(
    grade=pl.when(pl.col("score") >= 90)
        .then("A")
        .when(pl.col("score") >= 80)
        .then("B")
        .when(pl.col("score") >= 70)
        .then("C")
        .when(pl.col("score") >= 60)
        .then("D")
        .otherwise("F")
)
```

**Complex conditions:**
```python
df.with_columns(
    category=pl.when(
        (pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
    )
    .then("enterprise")
    .when(
        (pl.col("revenue") > 100000) | (pl.col("customers") > 50)
    )
    .then("business")
    .otherwise("starter")
)
```

### Null Handling

**Check for nulls:**
```python
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())
```

**Fill nulls:**
```python
# Constant value
df.with_columns(pl.col("value").fill_null(0))

# Forward fill
df.with_columns(pl.col("value").fill_null(strategy="forward"))

# Backward fill
df.with_columns(pl.col("value").fill_null(strategy="backward"))

# Mean
df.with_columns(pl.col("value").fill_null(strategy="mean"))

# Per-group fill
df.with_columns(
    pl.col("value").fill_null(pl.col("value").mean()).over("group")
)
```

**Coalesce (first non-null):**
```python
df.with_columns(
    combined=pl.coalesce(["col1", "col2", "col3"])
)
```

### Column Selection Patterns

**By name:**
```python
df.select("col1", "col2", "col3")
```

**By pattern:**
```python
# Regex
df.select(pl.col("^sales_.*$"))

# Starts with
df.select(pl.col("^sales"))

# Ends with
df.select(pl.col("_total$"))

# Contains
df.select(pl.col(".*revenue.*"))
```

**By type:**
```python
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES))

# All string columns
df.select(pl.col(pl.Utf8))

# Multiple types
df.select(pl.col(pl.NUMERIC_DTYPES, pl.Boolean))
```

**Exclude columns:**
```python
df.select(pl.all().exclude("id", "timestamp"))
```

**Transform multiple columns:**
```python
# Apply same operation to multiple columns
df.select(
    pl.col("^sales_.*$") * 1.1  # 10% increase to all sales columns
)
```

### Aggregation Patterns

**Multiple aggregations:**
```python
df.group_by("category").agg(
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.col("value").std().alias("std_dev"),
    pl.col("id").count().alias("count"),
    pl.col("id").n_unique().alias("unique_count"),
    pl.col("value").min().alias("minimum"),
    pl.col("value").max().alias("maximum"),
    pl.col("value").quantile(0.5).alias("median"),
    pl.col("value").quantile(0.95).alias("p95")
)
```

**Conditional aggregations:**
```python
df.group_by("category").agg(
    # Count high values
    (pl.col("value") > 100).sum().alias("high_count"),

    # Average of filtered values
    pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),

    # Conditional sum
    pl.when(pl.col("status") == "completed")
        .then(pl.col("amount"))
        .otherwise(0)
        .sum()
        .alias("completed_total")
)
```

**Grouped transformations:**
```python
df.with_columns(
    # Group statistics
    group_mean=pl.col("value").mean().over("category"),
    group_std=pl.col("value").std().over("category"),

    # Rank within groups
    rank=pl.col("value").rank().over("category"),

    # Percentage of group total
    pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
)
```

## Common Pitfalls and Anti-Patterns

### Pitfall 1: Row Iteration

```python
# Bad: Never iterate rows
for row in df.iter_rows():
    # Process row
    result = row[0] * 2

# Good: Use vectorized operations
df = df.with_columns(result=pl.col("value") * 2)
```

### Pitfall 2: Modifying in Place

```python
# Bad: Polars is immutable, this doesn't work as expected
df["new_col"] = df["old_col"] * 2  # May work but not recommended

# Good: Functional style
df = df.with_columns(new_col=pl.col("old_col") * 2)
```

### Pitfall 3: Not Using Expressions

```python
# Bad: String-based operations
df.select("value * 2")  # Won't work

# Good: Expression-based
df.select(pl.col("value") * 2)
```

### Pitfall 4: Inefficient Joins

```python
# Bad: Join large tables without filtering
result = large_df1.join(large_df2, on="id")

# Good: Filter before joining
result = (
    large_df1.filter(pl.col("active"))
    .join(
        large_df2.filter(pl.col("status") == "valid"),
        on="id"
    )
)
```

### Pitfall 5: Not Specifying Types

```python
# Bad: Let Polars infer everything
df = pl.read_csv("data.csv")

# Good: Specify types for correctness and performance
df = pl.read_csv(
    "data.csv",
    dtypes={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
)
```

### Pitfall 6: Creating Many Small DataFrames

```python
# Bad: Many operations creating intermediate DataFrames
df1 = df.filter(pl.col("age") > 25)
df2 = df1.select("name", "age")
df3 = df2.sort("age")
result = df3.head(10)

# Good: Chain operations
result = (
    df.filter(pl.col("age") > 25)
    .select("name", "age")
    .sort("age")
    .head(10)
)

# Better: Use lazy mode
result = (
    df.lazy()
    .filter(pl.col("age") > 25)
    .select("name", "age")
    .sort("age")
    .head(10)
    .collect()
)
```

## Memory Management

### Monitor Memory Usage

```python
# Check DataFrame size
print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")

# Profile memory during operations
lf = pl.scan_csv("large.csv")
print(lf.explain())  # See query plan
```

### Reduce Memory Footprint

```python
# 1. Use lazy mode
lf = pl.scan_parquet("data.parquet")

# 2. Stream results
result = lf.collect(streaming=True)

# 3. Select only needed columns
lf = lf.select("col1", "col2")

# 4. Optimize data types
df = df.with_columns(
    pl.col("int_col").cast(pl.Int32),  # Downcast if possible
    pl.col("category").cast(pl.Categorical)  # For low cardinality
)

# 5. Drop columns not needed
df = df.drop("large_text_col", "unused_col")
```

## Testing and Debugging

### Inspect Query Plans

```python
lf = pl.scan_csv("data.csv")
query = lf.filter(pl.col("age") > 25).select("name", "age")

# View the optimized query plan
print(query.explain())

# View detailed query plan
print(query.explain(optimized=True))
```

### Sample Data for Development

```python
# Use n_rows for testing
df = pl.read_csv("large.csv", n_rows=1000)

# Or sample after reading
df_sample = df.sample(n=1000, seed=42)
```

### Validate Schemas

```python
# Check schema
print(df.schema)

# Ensure schema matches expectation
expected_schema = {
    "id": pl.Int64,
    "name": pl.Utf8,
    "date": pl.Date
}

assert df.schema == expected_schema
```

### Profile Performance

```python
import time

# Time operations
start = time.time()
result = lf.collect()
print(f"Execution time: {time.time() - start:.2f}s")

# Compare eager vs lazy
start = time.time()
df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
eager_time = time.time() - start

start = time.time()
df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
lazy_time = time.time() - start

print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")
```

## File Format Best Practices

### Choose the Right Format

**Parquet:**
- Best for: Large datasets, archival, data lakes
- Pros: Excellent compression, columnar, fast reads
- Cons: Not human-readable

**CSV:**
- Best for: Small datasets, human inspection, legacy systems
- Pros: Universal, human-readable
- Cons: Slow, large file size, no type preservation

**Arrow IPC:**
- Best for: Inter-process communication, temporary storage
- Pros: Fastest, zero-copy, preserves all types
- Cons: Less compression than Parquet

### File Reading Best Practices

```python
# 1. Use lazy reading
lf = pl.scan_parquet("data.parquet")  # Not read_parquet

# 2. Read multiple files efficiently
lf = pl.scan_parquet("data/*.parquet")  # Parallel reading

# 3. Specify schema when known
lf = pl.scan_csv(
    "data.csv",
    dtypes={"id": pl.Int64, "date": pl.Date}
)

# 4. Use predicate pushdown
result = lf.filter(pl.col("date") >= "2023-01-01").collect()
```

### File Writing Best Practices

```python
# 1. Use Parquet for large data
df.write_parquet("output.parquet", compression="zstd")

# 2. Partition large datasets
df.write_parquet("output", partition_by=["year", "month"])

# 3. Use streaming for very large writes
lf.sink_parquet("output.parquet")  # Streaming write

# 4. Optimize compression
df.write_parquet(
    "output.parquet",
    compression="snappy",  # Fast compression
    statistics=True  # Enable predicate pushdown on read
)
```

## Code Organization

### Reusable Expressions

```python
# Define reusable expressions
age_group = (
    pl.when(pl.col("age") < 18)
    .then("minor")
    .when(pl.col("age") < 65)
    .then("adult")
    .otherwise("senior")
)

revenue_per_customer = pl.col("revenue") / pl.col("customer_count")

# Use in multiple contexts
df = df.with_columns(
    age_group=age_group,
    rpc=revenue_per_customer
)

# Reuse in filtering
df = df.filter(revenue_per_customer > 100)
```

### Pipeline Functions

```python
def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
    """Clean and standardize data."""
    return lf.with_columns(
        pl.col("name").str.to_uppercase(),
        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
        pl.col("amount").fill_null(0)
    )

def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
    """Add computed features."""
    return lf.with_columns(
        month=pl.col("date").dt.month(),
        year=pl.col("date").dt.year(),
        amount_log=pl.col("amount").log()
    )

# Compose pipeline
result = (
    pl.scan_csv("data.csv")
    .pipe(clean_data)
    .pipe(add_features)
    .filter(pl.col("year") == 2023)
    .collect()
)
```

## Documentation

Always document complex expressions and transformations:

```python
# Good: Document intent
df = df.with_columns(
    # Calculate customer lifetime value as sum of purchases
    # divided by months since first purchase
    clv=(
        pl.col("total_purchases") /
        ((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
         .dt.total_days() / 30)
    )
)
```

## Version Compatibility

```python
# Check Polars version
import polars as pl
print(pl.__version__)

# Feature availability varies by version
# Document version requirements for production code
```