Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

381
skills/polars/SKILL.md Normal file
View File

@@ -0,0 +1,381 @@
---
name: polars
description: "Fast DataFrame library (Apache Arrow). Select, filter, group_by, joins, lazy evaluation, CSV/Parquet I/O, expression API, for high-performance data analysis workflows."
---
# Polars
## Overview
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
## Quick Start
### Installation and Basic Usage
Install Polars:
```python
uv pip install polars
```
Basic DataFrame creation and operations:
```python
import polars as pl
# Create DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
# Select columns
df.select("name", "age")
# Filter rows
df.filter(pl.col("age") > 25)
# Add computed columns
df.with_columns(
age_plus_10=pl.col("age") + 10
)
```
## Core Concepts
### Expressions
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
**Key principles:**
- Use `pl.col("column_name")` to reference columns
- Chain methods to build complex transformations
- Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
**Example:**
```python
# Expression-based computation
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
```
### Lazy vs Eager Evaluation
**Eager (DataFrame):** Operations execute immediately
```python
df = pl.read_csv("file.csv") # Reads immediately
result = df.filter(pl.col("age") > 25) # Executes immediately
```
**Lazy (LazyFrame):** Operations build a query plan, optimized before execution
```python
lf = pl.scan_csv("file.csv") # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # Now executes optimized query
```
**When to use lazy:**
- Working with large datasets
- Complex query pipelines
- When only some columns/rows are needed
- Performance is critical
**Benefits of lazy evaluation:**
- Automatic query optimization
- Predicate pushdown
- Projection pushdown
- Parallel execution
For detailed concepts, load `references/core_concepts.md`.
## Common Operations
### Select
Select and manipulate columns:
```python
# Select specific columns
df.select("name", "age")
# Select with expressions
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
# Select all columns matching a pattern
df.select(pl.col("^.*_id$"))
```
### Filter
Filter rows by conditions:
```python
# Single condition
df.filter(pl.col("age") > 25)
# Multiple conditions (cleaner than using &)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# Complex conditions
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
```
### With Columns
Add or modify columns while preserving existing ones:
```python
# Add new columns
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
# Parallel computation (all columns computed in parallel)
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
```
### Group By and Aggregations
Group data and compute aggregations:
```python
# Basic grouping
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
# Multiple group keys
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
# Conditional aggregations
df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
```
For detailed operation patterns, load `references/operations.md`.
## Aggregations and Window Functions
### Aggregation Functions
Common aggregations within `group_by` context:
- `pl.len()` - count rows
- `pl.col("x").sum()` - sum values
- `pl.col("x").mean()` - average
- `pl.col("x").min()` / `pl.col("x").max()` - extremes
- `pl.first()` / `pl.last()` - first/last values
### Window Functions with `over()`
Apply aggregations while preserving row count:
```python
# Add group statistics to each row
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
# Multiple grouping columns
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
```
**Mapping strategies:**
- `group_to_rows` (default): Preserves original row order
- `explode`: Faster but groups rows together
- `join`: Creates list columns
## Data I/O
### Supported Formats
Polars supports reading and writing:
- CSV, Parquet, JSON, Excel
- Databases (via connectors)
- Cloud storage (S3, Azure, GCS)
- Google BigQuery
- Multiple/partitioned files
### Common I/O Operations
**CSV:**
```python
# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
```
**Parquet (recommended for performance):**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
```
**JSON:**
```python
df = pl.read_json("file.json")
df.write_json("output.json")
```
For comprehensive I/O documentation, load `references/io_guide.md`.
## Transformations
### Joins
Combine DataFrames:
```python
# Inner join
df1.join(df2, on="id", how="inner")
# Left join
df1.join(df2, on="id", how="left")
# Join on different column names
df1.join(df2, left_on="user_id", right_on="id")
```
### Concatenation
Stack DataFrames:
```python
# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
# Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")
```
### Pivot and Unpivot
Reshape data:
```python
# Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
```
For detailed transformation examples, load `references/transformations.md`.
## Pandas Migration
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
### Conceptual Differences
- **No index**: Polars uses integer positions only
- **Strict typing**: No silent type conversions
- **Lazy evaluation**: Available via LazyFrame
- **Parallel by default**: Operations parallelized automatically
### Common Operation Mappings
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Select column | `df["col"]` | `df.select("col")` |
| Filter | `df[df["col"] > 10]` | `df.filter(pl.col("col") > 10)` |
| Add column | `df.assign(x=...)` | `df.with_columns(x=...)` |
| Group by | `df.groupby("col").agg(...)` | `df.group_by("col").agg(...)` |
| Window | `df.groupby("col").transform(...)` | `df.with_columns(...).over("col")` |
### Key Syntax Patterns
**Pandas sequential (slow):**
```python
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)
```
**Polars parallel (fast):**
```python
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)
```
For comprehensive migration guide, load `references/pandas_migration.md`.
## Best Practices
### Performance Optimization
1. **Use lazy evaluation for large datasets:**
```python
lf = pl.scan_csv("large.csv") # Don't use read_csv
result = lf.filter(...).select(...).collect()
```
2. **Avoid Python functions in hot paths:**
- Stay within expression API for parallelization
- Use `.map_elements()` only when necessary
- Prefer native Polars operations
3. **Use streaming for very large data:**
```python
lf.collect(streaming=True)
```
4. **Select only needed columns early:**
```python
# Good: Select columns early
lf.select("col1", "col2").filter(...)
# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")
```
5. **Use appropriate data types:**
- Categorical for low-cardinality strings
- Appropriate integer sizes (i32 vs i64)
- Date types for temporal data
### Expression Patterns
**Conditional operations:**
```python
pl.when(condition).then(value).otherwise(other_value)
```
**Column operations across multiple columns:**
```python
df.select(pl.col("^.*_value$") * 2) # Regex pattern
```
**Null handling:**
```python
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
```
For additional best practices and patterns, load `references/best_practices.md`.
## Resources
This skill includes comprehensive reference documentation:
### references/
- `core_concepts.md` - Detailed explanations of expressions, lazy evaluation, and type system
- `operations.md` - Comprehensive guide to all common operations with examples
- `pandas_migration.md` - Complete migration guide from pandas to Polars
- `io_guide.md` - Data I/O operations for all supported formats
- `transformations.md` - Joins, concatenation, pivots, and reshaping operations
- `best_practices.md` - Performance optimization tips and common patterns
Load these references as needed when users require detailed information about specific topics.

View File

@@ -0,0 +1,649 @@
# Polars Best Practices and Performance Guide
Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.
## Performance Optimization
### 1. Use Lazy Evaluation
**Always prefer lazy mode for large datasets:**
```python
# Bad: Eager mode loads everything immediately
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")
# Good: Lazy mode optimizes before execution
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
```
**Benefits of lazy evaluation:**
- Predicate pushdown (filter at source)
- Projection pushdown (read only needed columns)
- Query optimization
- Parallel execution planning
### 2. Filter and Select Early
Push filters and column selection as early as possible in the pipeline:
```python
# Bad: Process all data, then filter and select
result = (
lf.group_by("category")
.agg(pl.col("value").mean())
.join(other, on="category")
.filter(pl.col("value") > 100)
.select("category", "value")
)
# Good: Filter and select early
result = (
lf.select("category", "value") # Only needed columns
.filter(pl.col("value") > 100) # Filter early
.group_by("category")
.agg(pl.col("value").mean())
.join(other.select("category", "other_col"), on="category")
)
```
### 3. Avoid Python Functions
Stay within the expression API to maintain parallelization:
```python
# Bad: Python function disables parallelization
df = df.with_columns(
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)
# Good: Use native expressions (parallelized)
df = df.with_columns(result=pl.col("value") * 2)
```
**When you must use custom functions:**
```python
# If truly needed, be explicit
df = df.with_columns(
result=pl.col("value").map_elements(
custom_function,
return_dtype=pl.Float64,
skip_nulls=True # Optimize null handling
)
)
```
### 4. Use Streaming for Very Large Data
Enable streaming for datasets larger than RAM:
```python
# Streaming mode processes data in chunks
lf = pl.scan_parquet("very_large.parquet")
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
# Or use sink for direct streaming writes
lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")
```
### 5. Optimize Data Types
Choose appropriate data types to reduce memory and improve performance:
```python
# Bad: Default types may be wasteful
df = pl.read_csv("data.csv")
# Good: Specify optimal types
df = pl.read_csv(
"data.csv",
dtypes={
"id": pl.UInt32, # Instead of Int64 if values fit
"category": pl.Categorical, # For low-cardinality strings
"date": pl.Date, # Instead of String
"small_int": pl.Int16, # Instead of Int64
}
)
```
**Type optimization guidelines:**
- Use smallest integer type that fits your data
- Use `Categorical` for strings with low cardinality (<50% unique)
- Use `Date` instead of `Datetime` when time isn't needed
- Use `Boolean` instead of integers for binary flags
### 6. Parallel Operations
Structure code to maximize parallelization:
```python
# Bad: Sequential pipe operations disable parallelization
df = (
df.pipe(operation1)
.pipe(operation2)
.pipe(operation3)
)
# Good: Combined operations enable parallelization
df = df.with_columns(
result1=operation1_expr(),
result2=operation2_expr(),
result3=operation3_expr()
)
```
### 7. Rechunk After Concatenation
```python
# Concatenation can fragment data
combined = pl.concat([df1, df2, df3])
# Rechunk for better performance in subsequent operations
combined = pl.concat([df1, df2, df3], rechunk=True)
```
## Expression Patterns
### Conditional Logic
**Simple conditions:**
```python
df.with_columns(
status=pl.when(pl.col("age") >= 18)
.then("adult")
.otherwise("minor")
)
```
**Multiple conditions:**
```python
df.with_columns(
grade=pl.when(pl.col("score") >= 90)
.then("A")
.when(pl.col("score") >= 80)
.then("B")
.when(pl.col("score") >= 70)
.then("C")
.when(pl.col("score") >= 60)
.then("D")
.otherwise("F")
)
```
**Complex conditions:**
```python
df.with_columns(
category=pl.when(
(pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
)
.then("enterprise")
.when(
(pl.col("revenue") > 100000) | (pl.col("customers") > 50)
)
.then("business")
.otherwise("starter")
)
```
### Null Handling
**Check for nulls:**
```python
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())
```
**Fill nulls:**
```python
# Constant value
df.with_columns(pl.col("value").fill_null(0))
# Forward fill
df.with_columns(pl.col("value").fill_null(strategy="forward"))
# Backward fill
df.with_columns(pl.col("value").fill_null(strategy="backward"))
# Mean
df.with_columns(pl.col("value").fill_null(strategy="mean"))
# Per-group fill
df.with_columns(
pl.col("value").fill_null(pl.col("value").mean()).over("group")
)
```
**Coalesce (first non-null):**
```python
df.with_columns(
combined=pl.coalesce(["col1", "col2", "col3"])
)
```
### Column Selection Patterns
**By name:**
```python
df.select("col1", "col2", "col3")
```
**By pattern:**
```python
# Regex
df.select(pl.col("^sales_.*$"))
# Starts with
df.select(pl.col("^sales"))
# Ends with
df.select(pl.col("_total$"))
# Contains
df.select(pl.col(".*revenue.*"))
```
**By type:**
```python
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES))
# All string columns
df.select(pl.col(pl.Utf8))
# Multiple types
df.select(pl.col(pl.NUMERIC_DTYPES, pl.Boolean))
```
**Exclude columns:**
```python
df.select(pl.all().exclude("id", "timestamp"))
```
**Transform multiple columns:**
```python
# Apply same operation to multiple columns
df.select(
pl.col("^sales_.*$") * 1.1 # 10% increase to all sales columns
)
```
### Aggregation Patterns
**Multiple aggregations:**
```python
df.group_by("category").agg(
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.col("value").std().alias("std_dev"),
pl.col("id").count().alias("count"),
pl.col("id").n_unique().alias("unique_count"),
pl.col("value").min().alias("minimum"),
pl.col("value").max().alias("maximum"),
pl.col("value").quantile(0.5).alias("median"),
pl.col("value").quantile(0.95).alias("p95")
)
```
**Conditional aggregations:**
```python
df.group_by("category").agg(
# Count high values
(pl.col("value") > 100).sum().alias("high_count"),
# Average of filtered values
pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),
# Conditional sum
pl.when(pl.col("status") == "completed")
.then(pl.col("amount"))
.otherwise(0)
.sum()
.alias("completed_total")
)
```
**Grouped transformations:**
```python
df.with_columns(
# Group statistics
group_mean=pl.col("value").mean().over("category"),
group_std=pl.col("value").std().over("category"),
# Rank within groups
rank=pl.col("value").rank().over("category"),
# Percentage of group total
pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
)
```
## Common Pitfalls and Anti-Patterns
### Pitfall 1: Row Iteration
```python
# Bad: Never iterate rows
for row in df.iter_rows():
# Process row
result = row[0] * 2
# Good: Use vectorized operations
df = df.with_columns(result=pl.col("value") * 2)
```
### Pitfall 2: Modifying in Place
```python
# Bad: Polars is immutable, this doesn't work as expected
df["new_col"] = df["old_col"] * 2 # May work but not recommended
# Good: Functional style
df = df.with_columns(new_col=pl.col("old_col") * 2)
```
### Pitfall 3: Not Using Expressions
```python
# Bad: String-based operations
df.select("value * 2") # Won't work
# Good: Expression-based
df.select(pl.col("value") * 2)
```
### Pitfall 4: Inefficient Joins
```python
# Bad: Join large tables without filtering
result = large_df1.join(large_df2, on="id")
# Good: Filter before joining
result = (
large_df1.filter(pl.col("active"))
.join(
large_df2.filter(pl.col("status") == "valid"),
on="id"
)
)
```
### Pitfall 5: Not Specifying Types
```python
# Bad: Let Polars infer everything
df = pl.read_csv("data.csv")
# Good: Specify types for correctness and performance
df = pl.read_csv(
"data.csv",
dtypes={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
)
```
### Pitfall 6: Creating Many Small DataFrames
```python
# Bad: Many operations creating intermediate DataFrames
df1 = df.filter(pl.col("age") > 25)
df2 = df1.select("name", "age")
df3 = df2.sort("age")
result = df3.head(10)
# Good: Chain operations
result = (
df.filter(pl.col("age") > 25)
.select("name", "age")
.sort("age")
.head(10)
)
# Better: Use lazy mode
result = (
df.lazy()
.filter(pl.col("age") > 25)
.select("name", "age")
.sort("age")
.head(10)
.collect()
)
```
## Memory Management
### Monitor Memory Usage
```python
# Check DataFrame size
print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")
# Profile memory during operations
lf = pl.scan_csv("large.csv")
print(lf.explain()) # See query plan
```
### Reduce Memory Footprint
```python
# 1. Use lazy mode
lf = pl.scan_parquet("data.parquet")
# 2. Stream results
result = lf.collect(streaming=True)
# 3. Select only needed columns
lf = lf.select("col1", "col2")
# 4. Optimize data types
df = df.with_columns(
pl.col("int_col").cast(pl.Int32), # Downcast if possible
pl.col("category").cast(pl.Categorical) # For low cardinality
)
# 5. Drop columns not needed
df = df.drop("large_text_col", "unused_col")
```
## Testing and Debugging
### Inspect Query Plans
```python
lf = pl.scan_csv("data.csv")
query = lf.filter(pl.col("age") > 25).select("name", "age")
# View the optimized query plan
print(query.explain())
# View detailed query plan
print(query.explain(optimized=True))
```
### Sample Data for Development
```python
# Use n_rows for testing
df = pl.read_csv("large.csv", n_rows=1000)
# Or sample after reading
df_sample = df.sample(n=1000, seed=42)
```
### Validate Schemas
```python
# Check schema
print(df.schema)
# Ensure schema matches expectation
expected_schema = {
"id": pl.Int64,
"name": pl.Utf8,
"date": pl.Date
}
assert df.schema == expected_schema
```
### Profile Performance
```python
import time
# Time operations
start = time.time()
result = lf.collect()
print(f"Execution time: {time.time() - start:.2f}s")
# Compare eager vs lazy
start = time.time()
df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
eager_time = time.time() - start
start = time.time()
df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
lazy_time = time.time() - start
print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")
```
## File Format Best Practices
### Choose the Right Format
**Parquet:**
- Best for: Large datasets, archival, data lakes
- Pros: Excellent compression, columnar, fast reads
- Cons: Not human-readable
**CSV:**
- Best for: Small datasets, human inspection, legacy systems
- Pros: Universal, human-readable
- Cons: Slow, large file size, no type preservation
**Arrow IPC:**
- Best for: Inter-process communication, temporary storage
- Pros: Fastest, zero-copy, preserves all types
- Cons: Less compression than Parquet
### File Reading Best Practices
```python
# 1. Use lazy reading
lf = pl.scan_parquet("data.parquet") # Not read_parquet
# 2. Read multiple files efficiently
lf = pl.scan_parquet("data/*.parquet") # Parallel reading
# 3. Specify schema when known
lf = pl.scan_csv(
"data.csv",
dtypes={"id": pl.Int64, "date": pl.Date}
)
# 4. Use predicate pushdown
result = lf.filter(pl.col("date") >= "2023-01-01").collect()
```
### File Writing Best Practices
```python
# 1. Use Parquet for large data
df.write_parquet("output.parquet", compression="zstd")
# 2. Partition large datasets
df.write_parquet("output", partition_by=["year", "month"])
# 3. Use streaming for very large writes
lf.sink_parquet("output.parquet") # Streaming write
# 4. Optimize compression
df.write_parquet(
"output.parquet",
compression="snappy", # Fast compression
statistics=True # Enable predicate pushdown on read
)
```
## Code Organization
### Reusable Expressions
```python
# Define reusable expressions
age_group = (
pl.when(pl.col("age") < 18)
.then("minor")
.when(pl.col("age") < 65)
.then("adult")
.otherwise("senior")
)
revenue_per_customer = pl.col("revenue") / pl.col("customer_count")
# Use in multiple contexts
df = df.with_columns(
age_group=age_group,
rpc=revenue_per_customer
)
# Reuse in filtering
df = df.filter(revenue_per_customer > 100)
```
### Pipeline Functions
```python
def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
"""Clean and standardize data."""
return lf.with_columns(
pl.col("name").str.to_uppercase(),
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("amount").fill_null(0)
)
def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
"""Add computed features."""
return lf.with_columns(
month=pl.col("date").dt.month(),
year=pl.col("date").dt.year(),
amount_log=pl.col("amount").log()
)
# Compose pipeline
result = (
pl.scan_csv("data.csv")
.pipe(clean_data)
.pipe(add_features)
.filter(pl.col("year") == 2023)
.collect()
)
```
## Documentation
Always document complex expressions and transformations:
```python
# Good: Document intent
df = df.with_columns(
# Calculate customer lifetime value as sum of purchases
# divided by months since first purchase
clv=(
pl.col("total_purchases") /
((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
.dt.total_days() / 30)
)
)
```
## Version Compatibility
```python
# Check Polars version
import polars as pl
print(pl.__version__)
# Feature availability varies by version
# Document version requirements for production code
```

View File

@@ -0,0 +1,378 @@
# Polars Core Concepts
## Expressions
Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.
### What are Expressions?
An expression describes a transformation on data. It only materializes (executes) within specific contexts:
- `select()` - Select and transform columns
- `with_columns()` - Add or modify columns
- `filter()` - Filter rows
- `group_by().agg()` - Aggregate data
### Expression Syntax
**Basic column reference:**
```python
pl.col("column_name")
```
**Computed expressions:**
```python
# Arithmetic
pl.col("height") * 2
pl.col("price") + pl.col("tax")
# With alias
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")
# Method chaining
pl.col("name").str.to_uppercase().str.slice(0, 3)
```
### Expression Contexts
**Select context:**
```python
df.select(
"name", # Simple column name
pl.col("age"), # Expression
(pl.col("age") * 12).alias("age_in_months") # Computed expression
)
```
**With_columns context:**
```python
df.with_columns(
age_doubled=pl.col("age") * 2,
name_upper=pl.col("name").str.to_uppercase()
)
```
**Filter context:**
```python
df.filter(
pl.col("age") > 25,
pl.col("city").is_in(["NY", "LA", "SF"])
)
```
**Group_by context:**
```python
df.group_by("department").agg(
pl.col("salary").mean(),
pl.col("employee_id").count()
)
```
### Expression Expansion
Apply operations to multiple columns at once:
**All columns:**
```python
df.select(pl.all() * 2)
```
**Pattern matching:**
```python
# All columns ending with "_value"
df.select(pl.col("^.*_value$") * 100)
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES) + 1)
```
**Exclude patterns:**
```python
df.select(pl.all().exclude("id", "name"))
```
### Expression Composition
Expressions can be stored and reused:
```python
# Define reusable expressions
age_expression = pl.col("age") * 12
name_expression = pl.col("name").str.to_uppercase()
# Use in multiple contexts
df.select(age_expression, name_expression)
df.with_columns(age_months=age_expression)
```
## Data Types
Polars has a strict type system based on Apache Arrow.
### Core Data Types
**Numeric:**
- `Int8`, `Int16`, `Int32`, `Int64` - Signed integers
- `UInt8`, `UInt16`, `UInt32`, `UInt64` - Unsigned integers
- `Float32`, `Float64` - Floating point numbers
**Text:**
- `Utf8` / `String` - UTF-8 encoded strings
- `Categorical` - Categorized strings (low cardinality)
- `Enum` - Fixed set of string values
**Temporal:**
- `Date` - Calendar date (no time)
- `Datetime` - Date and time with optional timezone
- `Time` - Time of day
- `Duration` - Time duration/difference
**Boolean:**
- `Boolean` - True/False values
**Nested:**
- `List` - Variable-length lists
- `Array` - Fixed-length arrays
- `Struct` - Nested record structures
**Other:**
- `Binary` - Binary data
- `Object` - Python objects (avoid in production)
- `Null` - Null type
### Type Casting
Convert between types explicitly:
```python
# Cast to different type
df.select(
pl.col("age").cast(pl.Float64),
pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("id").cast(pl.Utf8)
)
```
### Null Handling
Polars uses consistent null handling across all types:
**Check for nulls:**
```python
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())
```
**Fill nulls:**
```python
pl.col("value").fill_null(0)
pl.col("value").fill_null(strategy="forward")
pl.col("value").fill_null(strategy="backward")
pl.col("value").fill_null(strategy="mean")
```
**Drop nulls:**
```python
df.drop_nulls() # Drop any row with nulls
df.drop_nulls(subset=["col1", "col2"]) # Drop rows with nulls in specific columns
```
### Categorical Data
Use categorical types for string columns with low cardinality (repeated values):
```python
# Cast to categorical
df.with_columns(
pl.col("category").cast(pl.Categorical)
)
# Benefits:
# - Reduced memory usage
# - Faster grouping and joining
# - Maintains order information
```
## Lazy vs Eager Evaluation
Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).
### Eager Evaluation (DataFrame)
Operations execute immediately:
```python
import polars as pl
# DataFrame operations execute right away
df = pl.read_csv("data.csv") # Reads file immediately
result = df.filter(pl.col("age") > 25) # Filters immediately
final = result.select("name", "age") # Selects immediately
```
**When to use eager:**
- Small datasets that fit in memory
- Interactive exploration in notebooks
- Simple one-off operations
- Immediate feedback needed
### Lazy Evaluation (LazyFrame)
Operations build a query plan, optimized before execution:
```python
import polars as pl
# LazyFrame operations build a query plan
lf = pl.scan_csv("data.csv") # Doesn't read yet
lf2 = lf.filter(pl.col("age") > 25) # Adds to plan
lf3 = lf2.select("name", "age") # Adds to plan
df = lf3.collect() # NOW executes optimized plan
```
**When to use lazy:**
- Large datasets
- Complex query pipelines
- Only need subset of data
- Performance is critical
- Streaming required
### Query Optimization
Polars automatically optimizes lazy queries:
**Predicate Pushdown:**
Filter operations pushed to data source when possible:
```python
# Only reads rows where age > 25 from CSV
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).collect()
```
**Projection Pushdown:**
Only read needed columns from data source:
```python
# Only reads "name" and "age" columns from CSV
lf = pl.scan_csv("data.csv")
result = lf.select("name", "age").collect()
```
**Query Plan Inspection:**
```python
# View the optimized query plan
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age")
print(result.explain()) # Shows optimized plan
```
### Streaming Mode
Process data larger than memory:
```python
# Enable streaming for very large datasets
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("age") > 25).collect(streaming=True)
```
**Streaming benefits:**
- Process data larger than RAM
- Lower peak memory usage
- Chunk-based processing
- Automatic memory management
**Streaming limitations:**
- Not all operations support streaming
- May be slower for small data
- Some operations require materializing entire dataset
### Converting Between Eager and Lazy
**Eager to Lazy:**
```python
df = pl.read_csv("data.csv")
lf = df.lazy() # Convert to LazyFrame
```
**Lazy to Eager:**
```python
lf = pl.scan_csv("data.csv")
df = lf.collect() # Execute and return DataFrame
```
## Memory Format
Polars uses Apache Arrow columnar memory format:
**Benefits:**
- Zero-copy data sharing with other Arrow libraries
- Efficient columnar operations
- SIMD vectorization
- Reduced memory overhead
- Fast serialization
**Implications:**
- Data stored column-wise, not row-wise
- Column operations very fast
- Random row access slower than pandas
- Best for analytical workloads
## Parallelization
Polars parallelizes operations automatically using Rust's concurrency:
**What gets parallelized:**
- Aggregations within groups
- Window functions
- Most expression evaluations
- File reading (multiple files)
- Join operations
**What to avoid for parallelization:**
- Python user-defined functions (UDFs)
- Lambda functions in `.map_elements()`
- Sequential `.pipe()` chains
**Best practice:**
```python
# Good: Stays in expression API (parallelized)
df.with_columns(
pl.col("value") * 10,
pl.col("value").log(),
pl.col("value").sqrt()
)
# Bad: Uses Python function (sequential)
df.with_columns(
pl.col("value").map_elements(lambda x: x * 10)
)
```
## Strict Type System
Polars enforces strict typing:
**No silent conversions:**
```python
# This will error - can't mix types
# df.with_columns(pl.col("int_col") + "string")
# Must cast explicitly
df.with_columns(
pl.col("int_col").cast(pl.Utf8) + "_suffix"
)
```
**Benefits:**
- Prevents silent bugs
- Predictable behavior
- Better performance
- Clearer code intent
**Integer nulls:**
Unlike pandas, integer columns can have nulls without converting to float:
```python
# In pandas: Int column with null becomes Float
# In polars: Int column with null stays Int (with null values)
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
# dtype: Int64 (not Float64)
```

View File

@@ -0,0 +1,557 @@
# Polars Data I/O Guide
Comprehensive guide to reading and writing data in various formats with Polars.
## CSV Files
### Reading CSV
**Eager mode (loads into memory):**
```python
import polars as pl
# Basic read
df = pl.read_csv("data.csv")
# With options
df = pl.read_csv(
"data.csv",
separator=",",
has_header=True,
columns=["col1", "col2"], # Select specific columns
n_rows=1000, # Read only first 1000 rows
skip_rows=10, # Skip first 10 rows
dtypes={"col1": pl.Int64, "col2": pl.Utf8}, # Specify types
null_values=["NA", "null", ""], # Define null values
encoding="utf-8",
ignore_errors=False
)
```
**Lazy mode (scans without loading - recommended for large files):**
```python
# Scan CSV (builds query plan)
lf = pl.scan_csv("data.csv")
# Apply operations
result = lf.filter(pl.col("age") > 25).select("name", "age")
# Execute and load
df = result.collect()
```
### Writing CSV
```python
# Basic write
df.write_csv("output.csv")
# With options
df.write_csv(
"output.csv",
separator=",",
include_header=True,
null_value="", # How to represent nulls
quote_char='"',
line_terminator="\n"
)
```
### Multiple CSV Files
**Read multiple files:**
```python
# Read all CSVs in directory
lf = pl.scan_csv("data/*.csv")
# Read specific files
lf = pl.scan_csv(["file1.csv", "file2.csv", "file3.csv"])
```
## Parquet Files
Parquet is the recommended format for performance and compression.
### Reading Parquet
**Eager:**
```python
df = pl.read_parquet("data.parquet")
# With options
df = pl.read_parquet(
"data.parquet",
columns=["col1", "col2"], # Select specific columns
n_rows=1000, # Read first N rows
parallel="auto" # Control parallelization
)
```
**Lazy (recommended):**
```python
lf = pl.scan_parquet("data.parquet")
# Automatic predicate and projection pushdown
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
```
### Writing Parquet
```python
# Basic write
df.write_parquet("output.parquet")
# With compression
df.write_parquet(
"output.parquet",
compression="snappy", # Options: "snappy", "gzip", "brotli", "lz4", "zstd"
statistics=True, # Write statistics (enables predicate pushdown)
use_pyarrow=False # Use Rust writer (faster)
)
```
### Partitioned Parquet (Hive-style)
**Write partitioned:**
```python
# Write with partitioning
df.write_parquet(
"output_dir",
partition_by=["year", "month"] # Creates directory structure
)
# Creates: output_dir/year=2023/month=01/data.parquet
```
**Read partitioned:**
```python
lf = pl.scan_parquet("output_dir/**/*.parquet")
# Hive partitioning columns are automatically added
result = lf.filter(pl.col("year") == 2023).collect()
```
## JSON Files
### Reading JSON
**NDJSON (newline-delimited JSON) - recommended:**
```python
df = pl.read_ndjson("data.ndjson")
# Lazy
lf = pl.scan_ndjson("data.ndjson")
```
**Standard JSON:**
```python
df = pl.read_json("data.json")
# From JSON string
df = pl.read_json('{"col1": [1, 2], "col2": ["a", "b"]}')
```
### Writing JSON
```python
# Write NDJSON
df.write_ndjson("output.ndjson")
# Write standard JSON
df.write_json("output.json")
# Pretty printed
df.write_json("output.json", pretty=True, row_oriented=False)
```
## Excel Files
### Reading Excel
```python
# Read first sheet
df = pl.read_excel("data.xlsx")
# Specific sheet
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
# Or by index
df = pl.read_excel("data.xlsx", sheet_id=0)
# With options
df = pl.read_excel(
"data.xlsx",
sheet_name="Sheet1",
columns=["A", "B", "C"], # Excel columns
n_rows=100,
skip_rows=5,
has_header=True
)
```
### Writing Excel
```python
# Write to Excel
df.write_excel("output.xlsx")
# Multiple sheets
with pl.ExcelWriter("output.xlsx") as writer:
df1.write_excel(writer, worksheet="Sheet1")
df2.write_excel(writer, worksheet="Sheet2")
```
## Database Connectivity
### Read from Database
```python
import polars as pl
# Read entire table
df = pl.read_database("SELECT * FROM users", connection_uri="postgresql://...")
# Using connectorx for better performance
df = pl.read_database_uri(
"SELECT * FROM users WHERE age > 25",
uri="postgresql://user:pass@localhost/db"
)
```
### Write to Database
```python
# Using SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/db")
df.write_database("table_name", connection=engine)
# With options
df.write_database(
"table_name",
connection=engine,
if_exists="replace", # or "append", "fail"
)
```
### Common Database Connectors
**PostgreSQL:**
```python
uri = "postgresql://username:password@localhost:5432/database"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
```
**MySQL:**
```python
uri = "mysql://username:password@localhost:3306/database"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
```
**SQLite:**
```python
uri = "sqlite:///path/to/database.db"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
```
## Cloud Storage
### AWS S3
```python
# Read from S3
df = pl.read_parquet("s3://bucket/path/to/file.parquet")
lf = pl.scan_parquet("s3://bucket/path/*.parquet")
# Write to S3
df.write_parquet("s3://bucket/path/output.parquet")
# With credentials
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret"
os.environ["AWS_REGION"] = "us-west-2"
df = pl.read_parquet("s3://bucket/file.parquet")
```
### Azure Blob Storage
```python
# Read from Azure
df = pl.read_parquet("az://container/path/file.parquet")
# Write to Azure
df.write_parquet("az://container/path/output.parquet")
# With credentials
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "account"
os.environ["AZURE_STORAGE_ACCOUNT_KEY"] = "key"
```
### Google Cloud Storage (GCS)
```python
# Read from GCS
df = pl.read_parquet("gs://bucket/path/file.parquet")
# Write to GCS
df.write_parquet("gs://bucket/path/output.parquet")
# With credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials.json"
```
## Google BigQuery
```python
# Read from BigQuery
df = pl.read_database(
"SELECT * FROM project.dataset.table",
connection_uri="bigquery://project"
)
# Or using Google Cloud SDK
from google.cloud import bigquery
client = bigquery.Client()
query = "SELECT * FROM project.dataset.table WHERE date > '2023-01-01'"
df = pl.from_pandas(client.query(query).to_dataframe())
```
## Apache Arrow
### IPC/Feather Format
**Read:**
```python
df = pl.read_ipc("data.arrow")
lf = pl.scan_ipc("data.arrow")
```
**Write:**
```python
df.write_ipc("output.arrow")
# Compressed
df.write_ipc("output.arrow", compression="zstd")
```
### Arrow Streaming
```python
# Write streaming format
df.write_ipc("output.arrows", compression="zstd")
# Read streaming
df = pl.read_ipc("output.arrows")
```
### From/To Arrow
```python
import pyarrow as pa
# From Arrow Table
arrow_table = pa.table({"col": [1, 2, 3]})
df = pl.from_arrow(arrow_table)
# To Arrow Table
arrow_table = df.to_arrow()
```
## In-Memory Formats
### Python Dictionaries
```python
# From dict
df = pl.DataFrame({
"col1": [1, 2, 3],
"col2": ["a", "b", "c"]
})
# To dict
data_dict = df.to_dict() # Column-oriented
data_dict = df.to_dict(as_series=False) # Lists instead of Series
```
### NumPy Arrays
```python
import numpy as np
# From NumPy
arr = np.array([[1, 2], [3, 4], [5, 6]])
df = pl.DataFrame(arr, schema=["col1", "col2"])
# To NumPy
arr = df.to_numpy()
```
### Pandas DataFrames
```python
import pandas as pd
# From Pandas
pd_df = pd.DataFrame({"col": [1, 2, 3]})
pl_df = pl.from_pandas(pd_df)
# To Pandas
pd_df = pl_df.to_pandas()
# Zero-copy when possible
pl_df = pl.from_arrow(pd_df)
```
### Lists of Rows
```python
# From list of dicts
data = [
{"name": "Alice", "age": 25},
{"name": "Bob", "age": 30}
]
df = pl.DataFrame(data)
# To list of dicts
rows = df.to_dicts()
# From list of tuples
data = [("Alice", 25), ("Bob", 30)]
df = pl.DataFrame(data, schema=["name", "age"])
```
## Streaming Large Files
For datasets larger than memory, use lazy mode with streaming:
```python
# Streaming mode
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
# Streaming with multiple files
lf = pl.scan_parquet("data/*.parquet")
result = lf.group_by("category").agg(pl.col("value").sum()).collect(streaming=True)
```
## Best Practices
### Format Selection
**Use Parquet when:**
- Need compression (up to 10x smaller than CSV)
- Want fast reads/writes
- Need to preserve data types
- Working with large datasets
- Need predicate pushdown
**Use CSV when:**
- Need human-readable format
- Interfacing with legacy systems
- Data is small
- Need universal compatibility
**Use JSON when:**
- Working with nested/hierarchical data
- Need web API compatibility
- Data has flexible schema
**Use Arrow IPC when:**
- Need zero-copy data sharing
- Fastest serialization required
- Working between Arrow-compatible systems
### Reading Large Files
```python
# 1. Always use lazy mode
lf = pl.scan_csv("large.csv") # NOT read_csv
# 2. Filter and select early (pushdown optimization)
result = (
lf
.select("col1", "col2", "col3") # Only needed columns
.filter(pl.col("date") > "2023-01-01") # Filter early
.collect()
)
# 3. Use streaming for very large data
result = lf.filter(...).select(...).collect(streaming=True)
# 4. Read only needed rows during development
df = pl.read_csv("large.csv", n_rows=10000) # Sample for testing
```
### Writing Large Files
```python
# 1. Use Parquet with compression
df.write_parquet("output.parquet", compression="zstd")
# 2. Use partitioning for very large datasets
df.write_parquet("output", partition_by=["year", "month"])
# 3. Write streaming
lf = pl.scan_csv("input.csv")
lf.sink_parquet("output.parquet") # Streaming write
```
### Performance Tips
```python
# 1. Specify dtypes when reading CSV
df = pl.read_csv(
"data.csv",
dtypes={"id": pl.Int64, "name": pl.Utf8} # Avoids inference
)
# 2. Use appropriate compression
df.write_parquet("output.parquet", compression="snappy") # Fast
df.write_parquet("output.parquet", compression="zstd") # Better compression
# 3. Parallel reading
df = pl.read_csv("data.csv", parallel="auto")
# 4. Read multiple files in parallel
lf = pl.scan_parquet("data/*.parquet") # Automatic parallel read
```
## Error Handling
```python
try:
df = pl.read_csv("data.csv")
except pl.exceptions.ComputeError as e:
print(f"Error reading CSV: {e}")
# Ignore errors during parsing
df = pl.read_csv("messy.csv", ignore_errors=True)
# Handle missing files
from pathlib import Path
if Path("data.csv").exists():
df = pl.read_csv("data.csv")
else:
print("File not found")
```
## Schema Management
```python
# Infer schema from sample
schema = pl.read_csv("data.csv", n_rows=1000).schema
# Use inferred schema for full read
df = pl.read_csv("data.csv", dtypes=schema)
# Define schema explicitly
schema = {
"id": pl.Int64,
"name": pl.Utf8,
"date": pl.Date,
"value": pl.Float64
}
df = pl.read_csv("data.csv", dtypes=schema)
```

View File

@@ -0,0 +1,602 @@
# Polars Operations Reference
This reference covers all common Polars operations with comprehensive examples.
## Selection Operations
### Select Columns
**Basic selection:**
```python
# Select specific columns
df.select("name", "age", "city")
# Using expressions
df.select(pl.col("name"), pl.col("age"))
```
**Pattern-based selection:**
```python
# All columns starting with "sales_"
df.select(pl.col("^sales_.*$"))
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES))
# All columns except specific ones
df.select(pl.all().exclude("id", "timestamp"))
```
**Computed columns:**
```python
df.select(
"name",
(pl.col("age") * 12).alias("age_in_months"),
(pl.col("salary") * 1.1).alias("salary_after_raise")
)
```
### With Columns (Add/Modify)
Add new columns or modify existing ones while preserving all other columns:
```python
# Add new columns
df.with_columns(
age_doubled=pl.col("age") * 2,
full_name=pl.col("first_name") + " " + pl.col("last_name")
)
# Modify existing columns
df.with_columns(
pl.col("name").str.to_uppercase().alias("name"),
pl.col("salary").cast(pl.Float64).alias("salary")
)
# Multiple operations in parallel
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
pl.col("value") * 1000,
)
```
## Filtering Operations
### Basic Filtering
```python
# Single condition
df.filter(pl.col("age") > 25)
# Multiple conditions (AND)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
# OR conditions
df.filter(
(pl.col("age") > 30) | (pl.col("salary") > 100000)
)
# NOT condition
df.filter(~pl.col("active"))
df.filter(pl.col("city") != "NY")
```
### Advanced Filtering
**String operations:**
```python
# Contains substring
df.filter(pl.col("name").str.contains("John"))
# Starts with
df.filter(pl.col("email").str.starts_with("admin"))
# Regex match
df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{3}-\d{4}$"))
```
**Membership checks:**
```python
# In list
df.filter(pl.col("city").is_in(["NY", "LA", "SF"]))
# Not in list
df.filter(~pl.col("status").is_in(["inactive", "deleted"]))
```
**Range filters:**
```python
# Between values
df.filter(pl.col("age").is_between(25, 35))
# Date range
df.filter(
pl.col("date") >= pl.date(2023, 1, 1),
pl.col("date") <= pl.date(2023, 12, 31)
)
```
**Null filtering:**
```python
# Filter out nulls
df.filter(pl.col("value").is_not_null())
# Keep only nulls
df.filter(pl.col("value").is_null())
```
## Grouping and Aggregation
### Basic Group By
```python
# Group by single column
df.group_by("department").agg(
pl.col("salary").mean().alias("avg_salary"),
pl.len().alias("employee_count")
)
# Group by multiple columns
df.group_by("department", "location").agg(
pl.col("salary").sum()
)
# Maintain order
df.group_by("category", maintain_order=True).agg(
pl.col("value").sum()
)
```
### Aggregation Functions
**Count and length:**
```python
df.group_by("category").agg(
pl.len().alias("count"),
pl.col("id").count().alias("non_null_count"),
pl.col("id").n_unique().alias("unique_count")
)
```
**Statistical aggregations:**
```python
df.group_by("group").agg(
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.col("value").median().alias("median"),
pl.col("value").std().alias("std_dev"),
pl.col("value").var().alias("variance"),
pl.col("value").min().alias("minimum"),
pl.col("value").max().alias("maximum"),
pl.col("value").quantile(0.95).alias("p95")
)
```
**First and last:**
```python
df.group_by("user_id").agg(
pl.col("timestamp").first().alias("first_seen"),
pl.col("timestamp").last().alias("last_seen"),
pl.col("event").first().alias("first_event")
)
```
**List aggregation:**
```python
# Collect values into lists
df.group_by("category").agg(
pl.col("item").alias("all_items") # Creates list column
)
```
### Conditional Aggregations
Filter within aggregations:
```python
df.group_by("department").agg(
# Count high earners
(pl.col("salary") > 100000).sum().alias("high_earners"),
# Average of filtered values
pl.col("salary").filter(pl.col("bonus") > 0).mean().alias("avg_with_bonus"),
# Conditional sum
pl.when(pl.col("active"))
.then(pl.col("sales"))
.otherwise(0)
.sum()
.alias("active_sales")
)
```
### Multiple Aggregations
Combine multiple aggregations efficiently:
```python
df.group_by("store_id").agg(
pl.col("transaction_id").count().alias("num_transactions"),
pl.col("amount").sum().alias("total_sales"),
pl.col("amount").mean().alias("avg_transaction"),
pl.col("customer_id").n_unique().alias("unique_customers"),
pl.col("amount").max().alias("largest_transaction"),
pl.col("timestamp").min().alias("first_transaction_date"),
pl.col("timestamp").max().alias("last_transaction_date")
)
```
## Window Functions
Window functions apply aggregations while preserving the original row count.
### Basic Window Operations
**Group statistics:**
```python
# Add group mean to each row
df.with_columns(
avg_age_by_dept=pl.col("age").mean().over("department")
)
# Multiple group columns
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
```
**Ranking:**
```python
df.with_columns(
# Rank within groups
rank=pl.col("score").rank().over("team"),
# Dense rank (no gaps)
dense_rank=pl.col("score").rank(method="dense").over("team"),
# Row number
row_num=pl.col("timestamp").sort().rank(method="ordinal").over("user_id")
)
```
### Window Mapping Strategies
**group_to_rows (default):**
Preserves original row order:
```python
df.with_columns(
group_mean=pl.col("value").mean().over("category", mapping_strategy="group_to_rows")
)
```
**explode:**
Faster, groups rows together:
```python
df.with_columns(
group_mean=pl.col("value").mean().over("category", mapping_strategy="explode")
)
```
**join:**
Creates list columns:
```python
df.with_columns(
group_values=pl.col("value").over("category", mapping_strategy="join")
)
```
### Rolling Windows
**Time-based rolling:**
```python
df.with_columns(
rolling_avg=pl.col("value").rolling_mean(
window_size="7d",
by="date"
)
)
```
**Row-based rolling:**
```python
df.with_columns(
rolling_sum=pl.col("value").rolling_sum(window_size=3),
rolling_max=pl.col("value").rolling_max(window_size=5)
)
```
### Cumulative Operations
```python
df.with_columns(
cumsum=pl.col("value").cum_sum().over("group"),
cummax=pl.col("value").cum_max().over("group"),
cummin=pl.col("value").cum_min().over("group"),
cumprod=pl.col("value").cum_prod().over("group")
)
```
### Shift and Lag/Lead
```python
df.with_columns(
# Previous value (lag)
prev_value=pl.col("value").shift(1).over("user_id"),
# Next value (lead)
next_value=pl.col("value").shift(-1).over("user_id"),
# Calculate difference from previous
diff=pl.col("value") - pl.col("value").shift(1).over("user_id")
)
```
## Sorting
### Basic Sorting
```python
# Sort by single column
df.sort("age")
# Sort descending
df.sort("age", descending=True)
# Sort by multiple columns
df.sort("department", "age")
# Mixed sorting order
df.sort(["department", "salary"], descending=[False, True])
```
### Advanced Sorting
**Null handling:**
```python
# Nulls first
df.sort("value", nulls_last=False)
# Nulls last
df.sort("value", nulls_last=True)
```
**Sort by expression:**
```python
# Sort by computed value
df.sort(pl.col("first_name").str.len())
# Sort by multiple expressions
df.sort(
pl.col("last_name").str.to_lowercase(),
pl.col("age").abs()
)
```
## Conditional Operations
### When/Then/Otherwise
```python
# Basic conditional
df.with_columns(
status=pl.when(pl.col("age") >= 18)
.then("adult")
.otherwise("minor")
)
# Multiple conditions
df.with_columns(
category=pl.when(pl.col("score") >= 90)
.then("A")
.when(pl.col("score") >= 80)
.then("B")
.when(pl.col("score") >= 70)
.then("C")
.otherwise("F")
)
# Conditional computation
df.with_columns(
adjusted_price=pl.when(pl.col("is_member"))
.then(pl.col("price") * 0.9)
.otherwise(pl.col("price"))
)
```
## String Operations
### Common String Methods
```python
df.with_columns(
# Case conversion
upper=pl.col("name").str.to_uppercase(),
lower=pl.col("name").str.to_lowercase(),
title=pl.col("name").str.to_titlecase(),
# Trimming
trimmed=pl.col("text").str.strip_chars(),
# Substring
first_3=pl.col("name").str.slice(0, 3),
# Replace
cleaned=pl.col("text").str.replace("old", "new"),
cleaned_all=pl.col("text").str.replace_all("old", "new"),
# Split
parts=pl.col("full_name").str.split(" "),
# Length
name_length=pl.col("name").str.len_chars()
)
```
### String Filtering
```python
# Contains
df.filter(pl.col("email").str.contains("@gmail.com"))
# Starts/ends with
df.filter(pl.col("name").str.starts_with("A"))
df.filter(pl.col("file").str.ends_with(".csv"))
# Regex matching
df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{4}$"))
```
## Date and Time Operations
### Date Parsing
```python
# Parse strings to dates
df.with_columns(
date=pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d"),
datetime=pl.col("dt_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")
)
```
### Date Components
```python
df.with_columns(
year=pl.col("date").dt.year(),
month=pl.col("date").dt.month(),
day=pl.col("date").dt.day(),
weekday=pl.col("date").dt.weekday(),
hour=pl.col("datetime").dt.hour(),
minute=pl.col("datetime").dt.minute()
)
```
### Date Arithmetic
```python
# Add duration
df.with_columns(
next_week=pl.col("date") + pl.duration(weeks=1),
next_month=pl.col("date") + pl.duration(months=1)
)
# Difference between dates
df.with_columns(
days_diff=(pl.col("end_date") - pl.col("start_date")).dt.total_days()
)
```
### Date Filtering
```python
# Filter by date range
df.filter(
pl.col("date").is_between(pl.date(2023, 1, 1), pl.date(2023, 12, 31))
)
# Filter by year
df.filter(pl.col("date").dt.year() == 2023)
# Filter by month
df.filter(pl.col("date").dt.month().is_in([6, 7, 8])) # Summer months
```
## List Operations
### Working with List Columns
```python
# Create list column
df.with_columns(
items_list=pl.col("item1", "item2", "item3").to_list()
)
# List operations
df.with_columns(
list_len=pl.col("items").list.len(),
first_item=pl.col("items").list.first(),
last_item=pl.col("items").list.last(),
unique_items=pl.col("items").list.unique(),
sorted_items=pl.col("items").list.sort()
)
# Explode lists to rows
df.explode("items")
# Filter list elements
df.with_columns(
filtered=pl.col("items").list.eval(pl.element() > 10)
)
```
## Struct Operations
### Working with Nested Structures
```python
# Create struct column
df.with_columns(
address=pl.struct(["street", "city", "zip"])
)
# Access struct fields
df.with_columns(
city=pl.col("address").struct.field("city")
)
# Unnest struct to columns
df.unnest("address")
```
## Unique and Duplicate Operations
```python
# Get unique rows
df.unique()
# Unique on specific columns
df.unique(subset=["name", "email"])
# Keep first/last duplicate
df.unique(subset=["id"], keep="first")
df.unique(subset=["id"], keep="last")
# Identify duplicates
df.with_columns(
is_duplicate=pl.col("id").is_duplicated()
)
# Count duplicates
df.group_by("email").agg(
pl.len().alias("count")
).filter(pl.col("count") > 1)
```
## Sampling
```python
# Random sample
df.sample(n=100)
# Sample fraction
df.sample(fraction=0.1)
# Sample with seed for reproducibility
df.sample(n=100, seed=42)
```
## Column Renaming
```python
# Rename specific columns
df.rename({"old_name": "new_name", "age": "years"})
# Rename with expression
df.select(pl.col("*").name.suffix("_renamed"))
df.select(pl.col("*").name.prefix("data_"))
df.select(pl.col("*").name.to_uppercase())
```

View File

@@ -0,0 +1,417 @@
# Pandas to Polars Migration Guide
This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.
## Core Conceptual Differences
### 1. No Index System
**Pandas:** Uses row-based indexing system
```python
df.loc[0, "column"]
df.iloc[0:5]
df.set_index("id")
```
**Polars:** Uses integer positions only
```python
df[0, "column"] # Row position, column name
df[0:5] # Row slice
# No set_index equivalent - use group_by instead
```
### 2. Memory Format
**Pandas:** Row-oriented NumPy arrays
**Polars:** Columnar Apache Arrow format
**Implications:**
- Polars is faster for column operations
- Polars uses less memory
- Polars has better data sharing capabilities
### 3. Parallelization
**Pandas:** Primarily single-threaded (requires Dask for parallelism)
**Polars:** Parallel by default using Rust's concurrency
### 4. Lazy Evaluation
**Pandas:** Only eager evaluation
**Polars:** Both eager (DataFrame) and lazy (LazyFrame) with query optimization
### 5. Type Strictness
**Pandas:** Allows silent type conversions
**Polars:** Strict typing, explicit casts required
**Example:**
```python
# Pandas: Silently converts to float
pd_df["int_col"] = [1, 2, None, 4] # dtype: float64
# Polars: Keeps as integer with null
pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]}) # dtype: Int64
```
## Operation Mappings
### Data Selection
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Select column | `df["col"]` or `df.col` | `df.select("col")` or `df["col"]` |
| Select multiple | `df[["a", "b"]]` | `df.select("a", "b")` |
| Select by position | `df.iloc[:, 0:3]` | `df.select(pl.col(df.columns[0:3]))` |
| Select by condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
### Data Filtering
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Single condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
| Multiple conditions | `df[(df["age"] > 25) & (df["city"] == "NY")]` | `df.filter(pl.col("age") > 25, pl.col("city") == "NY")` |
| Query method | `df.query("age > 25")` | `df.filter(pl.col("age") > 25)` |
| isin | `df[df["city"].isin(["NY", "LA"])]` | `df.filter(pl.col("city").is_in(["NY", "LA"]))` |
| isna | `df[df["value"].isna()]` | `df.filter(pl.col("value").is_null())` |
| notna | `df[df["value"].notna()]` | `df.filter(pl.col("value").is_not_null())` |
### Adding/Modifying Columns
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Add column | `df["new"] = df["old"] * 2` | `df.with_columns(new=pl.col("old") * 2)` |
| Multiple columns | `df.assign(a=..., b=...)` | `df.with_columns(a=..., b=...)` |
| Conditional column | `np.where(condition, a, b)` | `pl.when(condition).then(a).otherwise(b)` |
**Important difference - Parallel execution:**
```python
# Pandas: Sequential (lambda sees previous results)
df.assign(
a=lambda df_: df_.value * 10,
b=lambda df_: df_.value * 100
)
# Polars: Parallel (all computed together)
df.with_columns(
a=pl.col("value") * 10,
b=pl.col("value") * 100
)
```
### Grouping and Aggregation
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Group by | `df.groupby("col")` | `df.group_by("col")` |
| Agg single | `df.groupby("col")["val"].mean()` | `df.group_by("col").agg(pl.col("val").mean())` |
| Agg multiple | `df.groupby("col").agg({"val": ["mean", "sum"]})` | `df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum())` |
| Size | `df.groupby("col").size()` | `df.group_by("col").agg(pl.len())` |
| Count | `df.groupby("col").count()` | `df.group_by("col").agg(pl.col("*").count())` |
### Window Functions
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Transform | `df.groupby("col").transform("mean")` | `df.with_columns(pl.col("val").mean().over("col"))` |
| Rank | `df.groupby("col")["val"].rank()` | `df.with_columns(pl.col("val").rank().over("col"))` |
| Shift | `df.groupby("col")["val"].shift(1)` | `df.with_columns(pl.col("val").shift(1).over("col"))` |
| Cumsum | `df.groupby("col")["val"].cumsum()` | `df.with_columns(pl.col("val").cum_sum().over("col"))` |
### Joins
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Inner join | `df1.merge(df2, on="id")` | `df1.join(df2, on="id", how="inner")` |
| Left join | `df1.merge(df2, on="id", how="left")` | `df1.join(df2, on="id", how="left")` |
| Different keys | `df1.merge(df2, left_on="a", right_on="b")` | `df1.join(df2, left_on="a", right_on="b")` |
### Concatenation
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Vertical | `pd.concat([df1, df2], axis=0)` | `pl.concat([df1, df2], how="vertical")` |
| Horizontal | `pd.concat([df1, df2], axis=1)` | `pl.concat([df1, df2], how="horizontal")` |
### Sorting
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Sort by column | `df.sort_values("col")` | `df.sort("col")` |
| Descending | `df.sort_values("col", ascending=False)` | `df.sort("col", descending=True)` |
| Multiple columns | `df.sort_values(["a", "b"])` | `df.sort("a", "b")` |
### Reshaping
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Pivot | `df.pivot(index="a", columns="b", values="c")` | `df.pivot(values="c", index="a", columns="b")` |
| Melt | `df.melt(id_vars="id")` | `df.unpivot(index="id")` |
### I/O Operations
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Read CSV | `pd.read_csv("file.csv")` | `pl.read_csv("file.csv")` or `pl.scan_csv()` |
| Write CSV | `df.to_csv("file.csv")` | `df.write_csv("file.csv")` |
| Read Parquet | `pd.read_parquet("file.parquet")` | `pl.read_parquet("file.parquet")` |
| Write Parquet | `df.to_parquet("file.parquet")` | `df.write_parquet("file.parquet")` |
| Read Excel | `pd.read_excel("file.xlsx")` | `pl.read_excel("file.xlsx")` |
### String Operations
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Upper | `df["col"].str.upper()` | `df.select(pl.col("col").str.to_uppercase())` |
| Lower | `df["col"].str.lower()` | `df.select(pl.col("col").str.to_lowercase())` |
| Contains | `df["col"].str.contains("pattern")` | `df.filter(pl.col("col").str.contains("pattern"))` |
| Replace | `df["col"].str.replace("old", "new")` | `df.select(pl.col("col").str.replace("old", "new"))` |
| Split | `df["col"].str.split(" ")` | `df.select(pl.col("col").str.split(" "))` |
### Datetime Operations
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Parse dates | `pd.to_datetime(df["col"])` | `df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d"))` |
| Year | `df["date"].dt.year` | `df.select(pl.col("date").dt.year())` |
| Month | `df["date"].dt.month` | `df.select(pl.col("date").dt.month())` |
| Day | `df["date"].dt.day` | `df.select(pl.col("date").dt.day())` |
### Missing Data
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Drop nulls | `df.dropna()` | `df.drop_nulls()` |
| Fill nulls | `df.fillna(0)` | `df.fill_null(0)` |
| Check null | `df["col"].isna()` | `df.select(pl.col("col").is_null())` |
| Forward fill | `df.fillna(method="ffill")` | `df.select(pl.col("col").fill_null(strategy="forward"))` |
### Other Operations
| Operation | Pandas | Polars |
|-----------|--------|--------|
| Unique values | `df["col"].unique()` | `df["col"].unique()` |
| Value counts | `df["col"].value_counts()` | `df["col"].value_counts()` |
| Describe | `df.describe()` | `df.describe()` |
| Sample | `df.sample(n=100)` | `df.sample(n=100)` |
| Head | `df.head()` | `df.head()` |
| Tail | `df.tail()` | `df.tail()` |
## Common Migration Patterns
### Pattern 1: Chained Operations
**Pandas:**
```python
result = (df
.assign(new_col=lambda x: x["old_col"] * 2)
.query("new_col > 10")
.groupby("category")
.agg({"value": "sum"})
.reset_index()
)
```
**Polars:**
```python
result = (df
.with_columns(new_col=pl.col("old_col") * 2)
.filter(pl.col("new_col") > 10)
.group_by("category")
.agg(pl.col("value").sum())
)
# No reset_index needed - Polars doesn't have index
```
### Pattern 2: Apply Functions
**Pandas:**
```python
# Avoid in Polars - breaks parallelization
df["result"] = df["value"].apply(lambda x: x * 2)
```
**Polars:**
```python
# Use expressions instead
df = df.with_columns(result=pl.col("value") * 2)
# If custom function needed
df = df.with_columns(
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)
```
### Pattern 3: Conditional Column Creation
**Pandas:**
```python
df["category"] = np.where(
df["value"] > 100,
"high",
np.where(df["value"] > 50, "medium", "low")
)
```
**Polars:**
```python
df = df.with_columns(
category=pl.when(pl.col("value") > 100)
.then("high")
.when(pl.col("value") > 50)
.then("medium")
.otherwise("low")
)
```
### Pattern 4: Group Transform
**Pandas:**
```python
df["group_mean"] = df.groupby("category")["value"].transform("mean")
```
**Polars:**
```python
df = df.with_columns(
group_mean=pl.col("value").mean().over("category")
)
```
### Pattern 5: Multiple Aggregations
**Pandas:**
```python
result = df.groupby("category").agg({
"value": ["mean", "sum", "count"],
"price": ["min", "max"]
})
```
**Polars:**
```python
result = df.group_by("category").agg(
pl.col("value").mean().alias("value_mean"),
pl.col("value").sum().alias("value_sum"),
pl.col("value").count().alias("value_count"),
pl.col("price").min().alias("price_min"),
pl.col("price").max().alias("price_max")
)
```
## Performance Anti-Patterns to Avoid
### Anti-Pattern 1: Sequential Pipe Operations
**Bad (disables parallelization):**
```python
df = df.pipe(function1).pipe(function2).pipe(function3)
```
**Good (enables parallelization):**
```python
df = df.with_columns(
function1_result(),
function2_result(),
function3_result()
)
```
### Anti-Pattern 2: Python Functions in Hot Paths
**Bad:**
```python
df = df.with_columns(
result=pl.col("value").map_elements(lambda x: x * 2)
)
```
**Good:**
```python
df = df.with_columns(result=pl.col("value") * 2)
```
### Anti-Pattern 3: Using Eager Reading for Large Files
**Bad:**
```python
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")
```
**Good:**
```python
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
```
### Anti-Pattern 4: Row Iteration
**Bad:**
```python
for row in df.iter_rows():
# Process row
pass
```
**Good:**
```python
# Use vectorized operations
df = df.with_columns(
# Vectorized computation
)
```
## Migration Checklist
When migrating from pandas to Polars:
1. **Remove index operations** - Use integer positions or group_by
2. **Replace apply/map with expressions** - Use Polars native operations
3. **Update column assignment** - Use `with_columns()` instead of direct assignment
4. **Change groupby.transform to .over()** - Window functions work differently
5. **Update string operations** - Use `.str.to_uppercase()` instead of `.str.upper()`
6. **Add explicit type casts** - Polars won't silently convert types
7. **Consider lazy evaluation** - Use `scan_*` instead of `read_*` for large data
8. **Update aggregation syntax** - More explicit in Polars
9. **Remove reset_index calls** - Not needed in Polars
10. **Update conditional logic** - Use `when().then().otherwise()` pattern
## Compatibility Layer
For gradual migration, you can use both libraries:
```python
import pandas as pd
import polars as pl
# Convert pandas to Polars
pl_df = pl.from_pandas(pd_df)
# Convert Polars to pandas
pd_df = pl_df.to_pandas()
# Use Arrow for zero-copy (when possible)
pl_df = pl.from_arrow(pd_df)
pd_df = pl_df.to_arrow().to_pandas()
```
## When to Stick with Pandas
Consider staying with pandas when:
- Working with time series requiring complex index operations
- Need extensive ecosystem support (some libraries only support pandas)
- Team lacks Rust/Polars expertise
- Data is small and performance isn't critical
- Using advanced pandas features without Polars equivalents
## When to Switch to Polars
Switch to Polars when:
- Performance is critical
- Working with large datasets (>1GB)
- Need lazy evaluation and query optimization
- Want better type safety
- Need parallel execution by default
- Starting a new project

View File

@@ -0,0 +1,549 @@
# Polars Data Transformations
Comprehensive guide to joins, concatenation, and reshaping operations in Polars.
## Joins
Joins combine data from multiple DataFrames based on common columns.
### Basic Join Types
**Inner Join (intersection):**
```python
# Keep only matching rows from both DataFrames
result = df1.join(df2, on="id", how="inner")
```
**Left Join (all left + matches from right):**
```python
# Keep all rows from left, add matching rows from right
result = df1.join(df2, on="id", how="left")
```
**Outer Join (union):**
```python
# Keep all rows from both DataFrames
result = df1.join(df2, on="id", how="outer")
```
**Cross Join (Cartesian product):**
```python
# Every row from left with every row from right
result = df1.join(df2, how="cross")
```
**Semi Join (filtered left):**
```python
# Keep only left rows that have a match in right
result = df1.join(df2, on="id", how="semi")
```
**Anti Join (non-matching left):**
```python
# Keep only left rows that DON'T have a match in right
result = df1.join(df2, on="id", how="anti")
```
### Join Syntax Variations
**Single column join:**
```python
df1.join(df2, on="id")
```
**Multiple columns join:**
```python
df1.join(df2, on=["id", "date"])
```
**Different column names:**
```python
df1.join(df2, left_on="user_id", right_on="id")
```
**Multiple different columns:**
```python
df1.join(
df2,
left_on=["user_id", "date"],
right_on=["id", "timestamp"]
)
```
### Suffix Handling
When both DataFrames have columns with the same name (other than join keys):
```python
# Add suffixes to distinguish columns
result = df1.join(df2, on="id", suffix="_right")
# Results in: value, value_right (if both had "value" column)
```
### Join Examples
**Example 1: Customer Orders**
```python
customers = pl.DataFrame({
"customer_id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Charlie", "David"]
})
orders = pl.DataFrame({
"order_id": [101, 102, 103],
"customer_id": [1, 2, 1],
"amount": [100, 200, 150]
})
# Inner join - only customers with orders
result = customers.join(orders, on="customer_id", how="inner")
# Left join - all customers, even without orders
result = customers.join(orders, on="customer_id", how="left")
```
**Example 2: Time-series data**
```python
prices = pl.DataFrame({
"date": ["2023-01-01", "2023-01-02", "2023-01-03"],
"stock": ["AAPL", "AAPL", "AAPL"],
"price": [150, 152, 151]
})
volumes = pl.DataFrame({
"date": ["2023-01-01", "2023-01-02"],
"stock": ["AAPL", "AAPL"],
"volume": [1000000, 1100000]
})
result = prices.join(
volumes,
on=["date", "stock"],
how="left"
)
```
### Asof Joins (Nearest Match)
For time-series data, join to nearest timestamp:
```python
# Join to nearest earlier timestamp
quotes = pl.DataFrame({
"timestamp": [1, 2, 3, 4, 5],
"stock": ["A", "A", "A", "A", "A"],
"quote": [100, 101, 102, 103, 104]
})
trades = pl.DataFrame({
"timestamp": [1.5, 3.5, 4.2],
"stock": ["A", "A", "A"],
"trade": [50, 75, 100]
})
result = trades.join_asof(
quotes,
on="timestamp",
by="stock",
strategy="backward" # or "forward", "nearest"
)
```
## Concatenation
Concatenation stacks DataFrames together.
### Vertical Concatenation (Stack Rows)
```python
df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})
# Stack rows
result = pl.concat([df1, df2], how="vertical")
# Result: 4 rows, same columns
```
**Handling mismatched schemas:**
```python
df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [5, 6], "c": [7, 8]})
# Diagonal concat - fills missing columns with nulls
result = pl.concat([df1, df2], how="diagonal")
# Result: columns a, b, c (with nulls where not present)
```
### Horizontal Concatenation (Stack Columns)
```python
df1 = pl.DataFrame({"a": [1, 2, 3]})
df2 = pl.DataFrame({"b": [4, 5, 6]})
# Stack columns
result = pl.concat([df1, df2], how="horizontal")
# Result: 3 rows, columns a and b
```
**Note:** Horizontal concat requires same number of rows.
### Concatenation Options
```python
# Rechunk after concatenation (better performance for subsequent operations)
result = pl.concat([df1, df2], rechunk=True)
# Parallel execution
result = pl.concat([df1, df2], parallel=True)
```
### Use Cases
**Combining data from multiple sources:**
```python
# Read multiple files and concatenate
files = ["data_2023.csv", "data_2024.csv", "data_2025.csv"]
dfs = [pl.read_csv(f) for f in files]
combined = pl.concat(dfs, how="vertical")
```
**Adding computed columns:**
```python
base = pl.DataFrame({"value": [1, 2, 3]})
computed = pl.DataFrame({"doubled": [2, 4, 6]})
result = pl.concat([base, computed], how="horizontal")
```
## Pivoting (Wide Format)
Convert unique values from one column into multiple columns.
### Basic Pivot
```python
df = pl.DataFrame({
"date": ["2023-01", "2023-01", "2023-02", "2023-02"],
"product": ["A", "B", "A", "B"],
"sales": [100, 150, 120, 160]
})
# Pivot: products become columns
pivoted = df.pivot(
values="sales",
index="date",
columns="product"
)
# Result:
# date | A | B
# 2023-01 | 100 | 150
# 2023-02 | 120 | 160
```
### Pivot with Aggregation
When there are duplicate combinations, aggregate:
```python
df = pl.DataFrame({
"date": ["2023-01", "2023-01", "2023-01"],
"product": ["A", "A", "B"],
"sales": [100, 110, 150]
})
# Aggregate duplicates
pivoted = df.pivot(
values="sales",
index="date",
columns="product",
aggregate_function="sum" # or "mean", "max", "min", etc.
)
```
### Multiple Index Columns
```python
df = pl.DataFrame({
"region": ["North", "North", "South", "South"],
"date": ["2023-01", "2023-01", "2023-01", "2023-01"],
"product": ["A", "B", "A", "B"],
"sales": [100, 150, 120, 160]
})
pivoted = df.pivot(
values="sales",
index=["region", "date"],
columns="product"
)
```
## Unpivoting/Melting (Long Format)
Convert multiple columns into rows (opposite of pivot).
### Basic Unpivot
```python
df = pl.DataFrame({
"date": ["2023-01", "2023-02"],
"product_A": [100, 120],
"product_B": [150, 160]
})
# Unpivot: convert columns to rows
unpivoted = df.unpivot(
index="date",
on=["product_A", "product_B"]
)
# Result:
# date | variable | value
# 2023-01 | product_A | 100
# 2023-01 | product_B | 150
# 2023-02 | product_A | 120
# 2023-02 | product_B | 160
```
### Custom Column Names
```python
unpivoted = df.unpivot(
index="date",
on=["product_A", "product_B"],
variable_name="product",
value_name="sales"
)
```
### Unpivot by Pattern
```python
# Unpivot all columns matching pattern
df = pl.DataFrame({
"id": [1, 2],
"sales_Q1": [100, 200],
"sales_Q2": [150, 250],
"sales_Q3": [120, 220],
"revenue_Q1": [1000, 2000]
})
# Unpivot all sales columns
unpivoted = df.unpivot(
index="id",
on=pl.col("^sales_.*$")
)
```
## Exploding (Unnesting Lists)
Convert list columns into multiple rows.
### Basic Explode
```python
df = pl.DataFrame({
"id": [1, 2],
"values": [[1, 2, 3], [4, 5]]
})
# Explode list into rows
exploded = df.explode("values")
# Result:
# id | values
# 1 | 1
# 1 | 2
# 1 | 3
# 2 | 4
# 2 | 5
```
### Multiple Column Explode
```python
df = pl.DataFrame({
"id": [1, 2],
"letters": [["a", "b"], ["c", "d"]],
"numbers": [[1, 2], [3, 4]]
})
# Explode multiple columns (must be same length)
exploded = df.explode("letters", "numbers")
```
## Transposing
Swap rows and columns:
```python
df = pl.DataFrame({
"metric": ["sales", "costs", "profit"],
"Q1": [100, 60, 40],
"Q2": [150, 80, 70]
})
# Transpose
transposed = df.transpose(
include_header=True,
header_name="quarter",
column_names="metric"
)
# Result: quarters as rows, metrics as columns
```
## Reshaping Patterns
### Pattern 1: Wide to Long to Wide
```python
# Start wide
wide = pl.DataFrame({
"id": [1, 2],
"A": [10, 20],
"B": [30, 40]
})
# To long
long = wide.unpivot(index="id", on=["A", "B"])
# Back to wide (maybe with transformations)
wide_again = long.pivot(values="value", index="id", columns="variable")
```
### Pattern 2: Nested to Flat
```python
# Nested data
df = pl.DataFrame({
"user": [1, 2],
"purchases": [
[{"item": "A", "qty": 2}, {"item": "B", "qty": 1}],
[{"item": "C", "qty": 3}]
]
})
# Explode and unnest
flat = (
df.explode("purchases")
.unnest("purchases")
)
```
### Pattern 3: Aggregation to Pivot
```python
# Raw data
sales = pl.DataFrame({
"date": ["2023-01", "2023-01", "2023-02"],
"product": ["A", "B", "A"],
"sales": [100, 150, 120]
})
# Aggregate then pivot
result = (
sales
.group_by("date", "product")
.agg(pl.col("sales").sum())
.pivot(values="sales", index="date", columns="product")
)
```
## Advanced Transformations
### Conditional Reshaping
```python
# Pivot only certain values
df.filter(pl.col("year") >= 2020).pivot(...)
# Unpivot with filtering
df.unpivot(index="id", on=pl.col("^sales.*$"))
```
### Multi-level Transformations
```python
# Complex reshaping pipeline
result = (
df
.unpivot(index="id", on=pl.col("^Q[0-9]_.*$"))
.with_columns(
quarter=pl.col("variable").str.extract(r"Q([0-9])", 1),
metric=pl.col("variable").str.extract(r"Q[0-9]_(.*)", 1)
)
.drop("variable")
.pivot(values="value", index=["id", "quarter"], columns="metric")
)
```
## Performance Considerations
### Join Performance
```python
# 1. Join on indexed/sorted columns when possible
df1_sorted = df1.sort("id")
df2_sorted = df2.sort("id")
result = df1_sorted.join(df2_sorted, on="id")
# 2. Use appropriate join type
# semi/anti are faster than inner+filter
matches = df1.join(df2, on="id", how="semi") # Better than filtering after inner join
# 3. Filter before joining
df1_filtered = df1.filter(pl.col("active"))
result = df1_filtered.join(df2, on="id") # Smaller join
```
### Concatenation Performance
```python
# 1. Rechunk after concatenation
result = pl.concat(dfs, rechunk=True)
# 2. Use lazy mode for large concatenations
lf1 = pl.scan_parquet("file1.parquet")
lf2 = pl.scan_parquet("file2.parquet")
result = pl.concat([lf1, lf2]).collect()
```
### Pivot Performance
```python
# 1. Filter before pivoting
pivoted = df.filter(pl.col("year") == 2023).pivot(...)
# 2. Specify aggregate function explicitly
pivoted = df.pivot(..., aggregate_function="first") # Faster than "sum" if only one value
```
## Common Use Cases
### Time Series Alignment
```python
# Align two time series with different timestamps
ts1.join_asof(ts2, on="timestamp", strategy="backward")
```
### Feature Engineering
```python
# Create lag features
df.with_columns(
pl.col("value").shift(1).over("user_id").alias("prev_value"),
pl.col("value").shift(2).over("user_id").alias("prev_prev_value")
)
```
### Data Denormalization
```python
# Combine normalized tables
orders.join(customers, on="customer_id").join(products, on="product_id")
```
### Report Generation
```python
# Pivot for reporting
sales.pivot(values="amount", index="month", columns="product")
```