Files
gh-k-dense-ai-claude-scient…/skills/polars/references/best_practices.md
2025-11-30 08:30:10 +08:00

650 lines
14 KiB
Markdown

# Polars Best Practices and Performance Guide
Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.
## Performance Optimization
### 1. Use Lazy Evaluation
**Always prefer lazy mode for large datasets:**
```python
# Bad: Eager mode loads everything immediately
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")
# Good: Lazy mode optimizes before execution
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
```
**Benefits of lazy evaluation:**
- Predicate pushdown (filter at source)
- Projection pushdown (read only needed columns)
- Query optimization
- Parallel execution planning
### 2. Filter and Select Early
Push filters and column selection as early as possible in the pipeline:
```python
# Bad: Process all data, then filter and select
result = (
lf.group_by("category")
.agg(pl.col("value").mean())
.join(other, on="category")
.filter(pl.col("value") > 100)
.select("category", "value")
)
# Good: Filter and select early
result = (
lf.select("category", "value") # Only needed columns
.filter(pl.col("value") > 100) # Filter early
.group_by("category")
.agg(pl.col("value").mean())
.join(other.select("category", "other_col"), on="category")
)
```
### 3. Avoid Python Functions
Stay within the expression API to maintain parallelization:
```python
# Bad: Python function disables parallelization
df = df.with_columns(
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)
# Good: Use native expressions (parallelized)
df = df.with_columns(result=pl.col("value") * 2)
```
**When you must use custom functions:**
```python
# If truly needed, be explicit
df = df.with_columns(
result=pl.col("value").map_elements(
custom_function,
return_dtype=pl.Float64,
skip_nulls=True # Optimize null handling
)
)
```
### 4. Use Streaming for Very Large Data
Enable streaming for datasets larger than RAM:
```python
# Streaming mode processes data in chunks
lf = pl.scan_parquet("very_large.parquet")
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
# Or use sink for direct streaming writes
lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")
```
### 5. Optimize Data Types
Choose appropriate data types to reduce memory and improve performance:
```python
# Bad: Default types may be wasteful
df = pl.read_csv("data.csv")
# Good: Specify optimal types
df = pl.read_csv(
"data.csv",
dtypes={
"id": pl.UInt32, # Instead of Int64 if values fit
"category": pl.Categorical, # For low-cardinality strings
"date": pl.Date, # Instead of String
"small_int": pl.Int16, # Instead of Int64
}
)
```
**Type optimization guidelines:**
- Use smallest integer type that fits your data
- Use `Categorical` for strings with low cardinality (<50% unique)
- Use `Date` instead of `Datetime` when time isn't needed
- Use `Boolean` instead of integers for binary flags
### 6. Parallel Operations
Structure code to maximize parallelization:
```python
# Bad: Sequential pipe operations disable parallelization
df = (
df.pipe(operation1)
.pipe(operation2)
.pipe(operation3)
)
# Good: Combined operations enable parallelization
df = df.with_columns(
result1=operation1_expr(),
result2=operation2_expr(),
result3=operation3_expr()
)
```
### 7. Rechunk After Concatenation
```python
# Concatenation can fragment data
combined = pl.concat([df1, df2, df3])
# Rechunk for better performance in subsequent operations
combined = pl.concat([df1, df2, df3], rechunk=True)
```
## Expression Patterns
### Conditional Logic
**Simple conditions:**
```python
df.with_columns(
status=pl.when(pl.col("age") >= 18)
.then("adult")
.otherwise("minor")
)
```
**Multiple conditions:**
```python
df.with_columns(
grade=pl.when(pl.col("score") >= 90)
.then("A")
.when(pl.col("score") >= 80)
.then("B")
.when(pl.col("score") >= 70)
.then("C")
.when(pl.col("score") >= 60)
.then("D")
.otherwise("F")
)
```
**Complex conditions:**
```python
df.with_columns(
category=pl.when(
(pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
)
.then("enterprise")
.when(
(pl.col("revenue") > 100000) | (pl.col("customers") > 50)
)
.then("business")
.otherwise("starter")
)
```
### Null Handling
**Check for nulls:**
```python
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())
```
**Fill nulls:**
```python
# Constant value
df.with_columns(pl.col("value").fill_null(0))
# Forward fill
df.with_columns(pl.col("value").fill_null(strategy="forward"))
# Backward fill
df.with_columns(pl.col("value").fill_null(strategy="backward"))
# Mean
df.with_columns(pl.col("value").fill_null(strategy="mean"))
# Per-group fill
df.with_columns(
pl.col("value").fill_null(pl.col("value").mean()).over("group")
)
```
**Coalesce (first non-null):**
```python
df.with_columns(
combined=pl.coalesce(["col1", "col2", "col3"])
)
```
### Column Selection Patterns
**By name:**
```python
df.select("col1", "col2", "col3")
```
**By pattern:**
```python
# Regex
df.select(pl.col("^sales_.*$"))
# Starts with
df.select(pl.col("^sales"))
# Ends with
df.select(pl.col("_total$"))
# Contains
df.select(pl.col(".*revenue.*"))
```
**By type:**
```python
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES))
# All string columns
df.select(pl.col(pl.Utf8))
# Multiple types
df.select(pl.col(pl.NUMERIC_DTYPES, pl.Boolean))
```
**Exclude columns:**
```python
df.select(pl.all().exclude("id", "timestamp"))
```
**Transform multiple columns:**
```python
# Apply same operation to multiple columns
df.select(
pl.col("^sales_.*$") * 1.1 # 10% increase to all sales columns
)
```
### Aggregation Patterns
**Multiple aggregations:**
```python
df.group_by("category").agg(
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.col("value").std().alias("std_dev"),
pl.col("id").count().alias("count"),
pl.col("id").n_unique().alias("unique_count"),
pl.col("value").min().alias("minimum"),
pl.col("value").max().alias("maximum"),
pl.col("value").quantile(0.5).alias("median"),
pl.col("value").quantile(0.95).alias("p95")
)
```
**Conditional aggregations:**
```python
df.group_by("category").agg(
# Count high values
(pl.col("value") > 100).sum().alias("high_count"),
# Average of filtered values
pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),
# Conditional sum
pl.when(pl.col("status") == "completed")
.then(pl.col("amount"))
.otherwise(0)
.sum()
.alias("completed_total")
)
```
**Grouped transformations:**
```python
df.with_columns(
# Group statistics
group_mean=pl.col("value").mean().over("category"),
group_std=pl.col("value").std().over("category"),
# Rank within groups
rank=pl.col("value").rank().over("category"),
# Percentage of group total
pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
)
```
## Common Pitfalls and Anti-Patterns
### Pitfall 1: Row Iteration
```python
# Bad: Never iterate rows
for row in df.iter_rows():
# Process row
result = row[0] * 2
# Good: Use vectorized operations
df = df.with_columns(result=pl.col("value") * 2)
```
### Pitfall 2: Modifying in Place
```python
# Bad: Polars is immutable, this doesn't work as expected
df["new_col"] = df["old_col"] * 2 # May work but not recommended
# Good: Functional style
df = df.with_columns(new_col=pl.col("old_col") * 2)
```
### Pitfall 3: Not Using Expressions
```python
# Bad: String-based operations
df.select("value * 2") # Won't work
# Good: Expression-based
df.select(pl.col("value") * 2)
```
### Pitfall 4: Inefficient Joins
```python
# Bad: Join large tables without filtering
result = large_df1.join(large_df2, on="id")
# Good: Filter before joining
result = (
large_df1.filter(pl.col("active"))
.join(
large_df2.filter(pl.col("status") == "valid"),
on="id"
)
)
```
### Pitfall 5: Not Specifying Types
```python
# Bad: Let Polars infer everything
df = pl.read_csv("data.csv")
# Good: Specify types for correctness and performance
df = pl.read_csv(
"data.csv",
dtypes={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
)
```
### Pitfall 6: Creating Many Small DataFrames
```python
# Bad: Many operations creating intermediate DataFrames
df1 = df.filter(pl.col("age") > 25)
df2 = df1.select("name", "age")
df3 = df2.sort("age")
result = df3.head(10)
# Good: Chain operations
result = (
df.filter(pl.col("age") > 25)
.select("name", "age")
.sort("age")
.head(10)
)
# Better: Use lazy mode
result = (
df.lazy()
.filter(pl.col("age") > 25)
.select("name", "age")
.sort("age")
.head(10)
.collect()
)
```
## Memory Management
### Monitor Memory Usage
```python
# Check DataFrame size
print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")
# Profile memory during operations
lf = pl.scan_csv("large.csv")
print(lf.explain()) # See query plan
```
### Reduce Memory Footprint
```python
# 1. Use lazy mode
lf = pl.scan_parquet("data.parquet")
# 2. Stream results
result = lf.collect(streaming=True)
# 3. Select only needed columns
lf = lf.select("col1", "col2")
# 4. Optimize data types
df = df.with_columns(
pl.col("int_col").cast(pl.Int32), # Downcast if possible
pl.col("category").cast(pl.Categorical) # For low cardinality
)
# 5. Drop columns not needed
df = df.drop("large_text_col", "unused_col")
```
## Testing and Debugging
### Inspect Query Plans
```python
lf = pl.scan_csv("data.csv")
query = lf.filter(pl.col("age") > 25).select("name", "age")
# View the optimized query plan
print(query.explain())
# View detailed query plan
print(query.explain(optimized=True))
```
### Sample Data for Development
```python
# Use n_rows for testing
df = pl.read_csv("large.csv", n_rows=1000)
# Or sample after reading
df_sample = df.sample(n=1000, seed=42)
```
### Validate Schemas
```python
# Check schema
print(df.schema)
# Ensure schema matches expectation
expected_schema = {
"id": pl.Int64,
"name": pl.Utf8,
"date": pl.Date
}
assert df.schema == expected_schema
```
### Profile Performance
```python
import time
# Time operations
start = time.time()
result = lf.collect()
print(f"Execution time: {time.time() - start:.2f}s")
# Compare eager vs lazy
start = time.time()
df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
eager_time = time.time() - start
start = time.time()
df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
lazy_time = time.time() - start
print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")
```
## File Format Best Practices
### Choose the Right Format
**Parquet:**
- Best for: Large datasets, archival, data lakes
- Pros: Excellent compression, columnar, fast reads
- Cons: Not human-readable
**CSV:**
- Best for: Small datasets, human inspection, legacy systems
- Pros: Universal, human-readable
- Cons: Slow, large file size, no type preservation
**Arrow IPC:**
- Best for: Inter-process communication, temporary storage
- Pros: Fastest, zero-copy, preserves all types
- Cons: Less compression than Parquet
### File Reading Best Practices
```python
# 1. Use lazy reading
lf = pl.scan_parquet("data.parquet") # Not read_parquet
# 2. Read multiple files efficiently
lf = pl.scan_parquet("data/*.parquet") # Parallel reading
# 3. Specify schema when known
lf = pl.scan_csv(
"data.csv",
dtypes={"id": pl.Int64, "date": pl.Date}
)
# 4. Use predicate pushdown
result = lf.filter(pl.col("date") >= "2023-01-01").collect()
```
### File Writing Best Practices
```python
# 1. Use Parquet for large data
df.write_parquet("output.parquet", compression="zstd")
# 2. Partition large datasets
df.write_parquet("output", partition_by=["year", "month"])
# 3. Use streaming for very large writes
lf.sink_parquet("output.parquet") # Streaming write
# 4. Optimize compression
df.write_parquet(
"output.parquet",
compression="snappy", # Fast compression
statistics=True # Enable predicate pushdown on read
)
```
## Code Organization
### Reusable Expressions
```python
# Define reusable expressions
age_group = (
pl.when(pl.col("age") < 18)
.then("minor")
.when(pl.col("age") < 65)
.then("adult")
.otherwise("senior")
)
revenue_per_customer = pl.col("revenue") / pl.col("customer_count")
# Use in multiple contexts
df = df.with_columns(
age_group=age_group,
rpc=revenue_per_customer
)
# Reuse in filtering
df = df.filter(revenue_per_customer > 100)
```
### Pipeline Functions
```python
def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
"""Clean and standardize data."""
return lf.with_columns(
pl.col("name").str.to_uppercase(),
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("amount").fill_null(0)
)
def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
"""Add computed features."""
return lf.with_columns(
month=pl.col("date").dt.month(),
year=pl.col("date").dt.year(),
amount_log=pl.col("amount").log()
)
# Compose pipeline
result = (
pl.scan_csv("data.csv")
.pipe(clean_data)
.pipe(add_features)
.filter(pl.col("year") == 2023)
.collect()
)
```
## Documentation
Always document complex expressions and transformations:
```python
# Good: Document intent
df = df.with_columns(
# Calculate customer lifetime value as sum of purchases
# divided by months since first purchase
clv=(
pl.col("total_purchases") /
((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
.dt.total_days() / 30)
)
)
```
## Version Compatibility
```python
# Check Polars version
import polars as pl
print(pl.__version__)
# Feature availability varies by version
# Document version requirements for production code
```