Initial commit
This commit is contained in:
649
skills/polars/references/best_practices.md
Normal file
649
skills/polars/references/best_practices.md
Normal file
@@ -0,0 +1,649 @@
|
||||
# Polars Best Practices and Performance Guide
|
||||
|
||||
Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### 1. Use Lazy Evaluation
|
||||
|
||||
**Always prefer lazy mode for large datasets:**
|
||||
|
||||
```python
|
||||
# Bad: Eager mode loads everything immediately
|
||||
df = pl.read_csv("large_file.csv")
|
||||
result = df.filter(pl.col("age") > 25).select("name", "age")
|
||||
|
||||
# Good: Lazy mode optimizes before execution
|
||||
lf = pl.scan_csv("large_file.csv")
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
|
||||
```
|
||||
|
||||
**Benefits of lazy evaluation:**
|
||||
- Predicate pushdown (filter at source)
|
||||
- Projection pushdown (read only needed columns)
|
||||
- Query optimization
|
||||
- Parallel execution planning
|
||||
|
||||
### 2. Filter and Select Early
|
||||
|
||||
Push filters and column selection as early as possible in the pipeline:
|
||||
|
||||
```python
|
||||
# Bad: Process all data, then filter and select
|
||||
result = (
|
||||
lf.group_by("category")
|
||||
.agg(pl.col("value").mean())
|
||||
.join(other, on="category")
|
||||
.filter(pl.col("value") > 100)
|
||||
.select("category", "value")
|
||||
)
|
||||
|
||||
# Good: Filter and select early
|
||||
result = (
|
||||
lf.select("category", "value") # Only needed columns
|
||||
.filter(pl.col("value") > 100) # Filter early
|
||||
.group_by("category")
|
||||
.agg(pl.col("value").mean())
|
||||
.join(other.select("category", "other_col"), on="category")
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Avoid Python Functions
|
||||
|
||||
Stay within the expression API to maintain parallelization:
|
||||
|
||||
```python
|
||||
# Bad: Python function disables parallelization
|
||||
df = df.with_columns(
|
||||
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
|
||||
)
|
||||
|
||||
# Good: Use native expressions (parallelized)
|
||||
df = df.with_columns(result=pl.col("value") * 2)
|
||||
```
|
||||
|
||||
**When you must use custom functions:**
|
||||
```python
|
||||
# If truly needed, be explicit
|
||||
df = df.with_columns(
|
||||
result=pl.col("value").map_elements(
|
||||
custom_function,
|
||||
return_dtype=pl.Float64,
|
||||
skip_nulls=True # Optimize null handling
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Use Streaming for Very Large Data
|
||||
|
||||
Enable streaming for datasets larger than RAM:
|
||||
|
||||
```python
|
||||
# Streaming mode processes data in chunks
|
||||
lf = pl.scan_parquet("very_large.parquet")
|
||||
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
|
||||
|
||||
# Or use sink for direct streaming writes
|
||||
lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")
|
||||
```
|
||||
|
||||
### 5. Optimize Data Types
|
||||
|
||||
Choose appropriate data types to reduce memory and improve performance:
|
||||
|
||||
```python
|
||||
# Bad: Default types may be wasteful
|
||||
df = pl.read_csv("data.csv")
|
||||
|
||||
# Good: Specify optimal types
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
dtypes={
|
||||
"id": pl.UInt32, # Instead of Int64 if values fit
|
||||
"category": pl.Categorical, # For low-cardinality strings
|
||||
"date": pl.Date, # Instead of String
|
||||
"small_int": pl.Int16, # Instead of Int64
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
**Type optimization guidelines:**
|
||||
- Use smallest integer type that fits your data
|
||||
- Use `Categorical` for strings with low cardinality (<50% unique)
|
||||
- Use `Date` instead of `Datetime` when time isn't needed
|
||||
- Use `Boolean` instead of integers for binary flags
|
||||
|
||||
### 6. Parallel Operations
|
||||
|
||||
Structure code to maximize parallelization:
|
||||
|
||||
```python
|
||||
# Bad: Sequential pipe operations disable parallelization
|
||||
df = (
|
||||
df.pipe(operation1)
|
||||
.pipe(operation2)
|
||||
.pipe(operation3)
|
||||
)
|
||||
|
||||
# Good: Combined operations enable parallelization
|
||||
df = df.with_columns(
|
||||
result1=operation1_expr(),
|
||||
result2=operation2_expr(),
|
||||
result3=operation3_expr()
|
||||
)
|
||||
```
|
||||
|
||||
### 7. Rechunk After Concatenation
|
||||
|
||||
```python
|
||||
# Concatenation can fragment data
|
||||
combined = pl.concat([df1, df2, df3])
|
||||
|
||||
# Rechunk for better performance in subsequent operations
|
||||
combined = pl.concat([df1, df2, df3], rechunk=True)
|
||||
```
|
||||
|
||||
## Expression Patterns
|
||||
|
||||
### Conditional Logic
|
||||
|
||||
**Simple conditions:**
|
||||
```python
|
||||
df.with_columns(
|
||||
status=pl.when(pl.col("age") >= 18)
|
||||
.then("adult")
|
||||
.otherwise("minor")
|
||||
)
|
||||
```
|
||||
|
||||
**Multiple conditions:**
|
||||
```python
|
||||
df.with_columns(
|
||||
grade=pl.when(pl.col("score") >= 90)
|
||||
.then("A")
|
||||
.when(pl.col("score") >= 80)
|
||||
.then("B")
|
||||
.when(pl.col("score") >= 70)
|
||||
.then("C")
|
||||
.when(pl.col("score") >= 60)
|
||||
.then("D")
|
||||
.otherwise("F")
|
||||
)
|
||||
```
|
||||
|
||||
**Complex conditions:**
|
||||
```python
|
||||
df.with_columns(
|
||||
category=pl.when(
|
||||
(pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
|
||||
)
|
||||
.then("enterprise")
|
||||
.when(
|
||||
(pl.col("revenue") > 100000) | (pl.col("customers") > 50)
|
||||
)
|
||||
.then("business")
|
||||
.otherwise("starter")
|
||||
)
|
||||
```
|
||||
|
||||
### Null Handling
|
||||
|
||||
**Check for nulls:**
|
||||
```python
|
||||
df.filter(pl.col("value").is_null())
|
||||
df.filter(pl.col("value").is_not_null())
|
||||
```
|
||||
|
||||
**Fill nulls:**
|
||||
```python
|
||||
# Constant value
|
||||
df.with_columns(pl.col("value").fill_null(0))
|
||||
|
||||
# Forward fill
|
||||
df.with_columns(pl.col("value").fill_null(strategy="forward"))
|
||||
|
||||
# Backward fill
|
||||
df.with_columns(pl.col("value").fill_null(strategy="backward"))
|
||||
|
||||
# Mean
|
||||
df.with_columns(pl.col("value").fill_null(strategy="mean"))
|
||||
|
||||
# Per-group fill
|
||||
df.with_columns(
|
||||
pl.col("value").fill_null(pl.col("value").mean()).over("group")
|
||||
)
|
||||
```
|
||||
|
||||
**Coalesce (first non-null):**
|
||||
```python
|
||||
df.with_columns(
|
||||
combined=pl.coalesce(["col1", "col2", "col3"])
|
||||
)
|
||||
```
|
||||
|
||||
### Column Selection Patterns
|
||||
|
||||
**By name:**
|
||||
```python
|
||||
df.select("col1", "col2", "col3")
|
||||
```
|
||||
|
||||
**By pattern:**
|
||||
```python
|
||||
# Regex
|
||||
df.select(pl.col("^sales_.*$"))
|
||||
|
||||
# Starts with
|
||||
df.select(pl.col("^sales"))
|
||||
|
||||
# Ends with
|
||||
df.select(pl.col("_total$"))
|
||||
|
||||
# Contains
|
||||
df.select(pl.col(".*revenue.*"))
|
||||
```
|
||||
|
||||
**By type:**
|
||||
```python
|
||||
# All numeric columns
|
||||
df.select(pl.col(pl.NUMERIC_DTYPES))
|
||||
|
||||
# All string columns
|
||||
df.select(pl.col(pl.Utf8))
|
||||
|
||||
# Multiple types
|
||||
df.select(pl.col(pl.NUMERIC_DTYPES, pl.Boolean))
|
||||
```
|
||||
|
||||
**Exclude columns:**
|
||||
```python
|
||||
df.select(pl.all().exclude("id", "timestamp"))
|
||||
```
|
||||
|
||||
**Transform multiple columns:**
|
||||
```python
|
||||
# Apply same operation to multiple columns
|
||||
df.select(
|
||||
pl.col("^sales_.*$") * 1.1 # 10% increase to all sales columns
|
||||
)
|
||||
```
|
||||
|
||||
### Aggregation Patterns
|
||||
|
||||
**Multiple aggregations:**
|
||||
```python
|
||||
df.group_by("category").agg(
|
||||
pl.col("value").sum().alias("total"),
|
||||
pl.col("value").mean().alias("average"),
|
||||
pl.col("value").std().alias("std_dev"),
|
||||
pl.col("id").count().alias("count"),
|
||||
pl.col("id").n_unique().alias("unique_count"),
|
||||
pl.col("value").min().alias("minimum"),
|
||||
pl.col("value").max().alias("maximum"),
|
||||
pl.col("value").quantile(0.5).alias("median"),
|
||||
pl.col("value").quantile(0.95).alias("p95")
|
||||
)
|
||||
```
|
||||
|
||||
**Conditional aggregations:**
|
||||
```python
|
||||
df.group_by("category").agg(
|
||||
# Count high values
|
||||
(pl.col("value") > 100).sum().alias("high_count"),
|
||||
|
||||
# Average of filtered values
|
||||
pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),
|
||||
|
||||
# Conditional sum
|
||||
pl.when(pl.col("status") == "completed")
|
||||
.then(pl.col("amount"))
|
||||
.otherwise(0)
|
||||
.sum()
|
||||
.alias("completed_total")
|
||||
)
|
||||
```
|
||||
|
||||
**Grouped transformations:**
|
||||
```python
|
||||
df.with_columns(
|
||||
# Group statistics
|
||||
group_mean=pl.col("value").mean().over("category"),
|
||||
group_std=pl.col("value").std().over("category"),
|
||||
|
||||
# Rank within groups
|
||||
rank=pl.col("value").rank().over("category"),
|
||||
|
||||
# Percentage of group total
|
||||
pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
|
||||
)
|
||||
```
|
||||
|
||||
## Common Pitfalls and Anti-Patterns
|
||||
|
||||
### Pitfall 1: Row Iteration
|
||||
|
||||
```python
|
||||
# Bad: Never iterate rows
|
||||
for row in df.iter_rows():
|
||||
# Process row
|
||||
result = row[0] * 2
|
||||
|
||||
# Good: Use vectorized operations
|
||||
df = df.with_columns(result=pl.col("value") * 2)
|
||||
```
|
||||
|
||||
### Pitfall 2: Modifying in Place
|
||||
|
||||
```python
|
||||
# Bad: Polars is immutable, this doesn't work as expected
|
||||
df["new_col"] = df["old_col"] * 2 # May work but not recommended
|
||||
|
||||
# Good: Functional style
|
||||
df = df.with_columns(new_col=pl.col("old_col") * 2)
|
||||
```
|
||||
|
||||
### Pitfall 3: Not Using Expressions
|
||||
|
||||
```python
|
||||
# Bad: String-based operations
|
||||
df.select("value * 2") # Won't work
|
||||
|
||||
# Good: Expression-based
|
||||
df.select(pl.col("value") * 2)
|
||||
```
|
||||
|
||||
### Pitfall 4: Inefficient Joins
|
||||
|
||||
```python
|
||||
# Bad: Join large tables without filtering
|
||||
result = large_df1.join(large_df2, on="id")
|
||||
|
||||
# Good: Filter before joining
|
||||
result = (
|
||||
large_df1.filter(pl.col("active"))
|
||||
.join(
|
||||
large_df2.filter(pl.col("status") == "valid"),
|
||||
on="id"
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Pitfall 5: Not Specifying Types
|
||||
|
||||
```python
|
||||
# Bad: Let Polars infer everything
|
||||
df = pl.read_csv("data.csv")
|
||||
|
||||
# Good: Specify types for correctness and performance
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
dtypes={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
|
||||
)
|
||||
```
|
||||
|
||||
### Pitfall 6: Creating Many Small DataFrames
|
||||
|
||||
```python
|
||||
# Bad: Many operations creating intermediate DataFrames
|
||||
df1 = df.filter(pl.col("age") > 25)
|
||||
df2 = df1.select("name", "age")
|
||||
df3 = df2.sort("age")
|
||||
result = df3.head(10)
|
||||
|
||||
# Good: Chain operations
|
||||
result = (
|
||||
df.filter(pl.col("age") > 25)
|
||||
.select("name", "age")
|
||||
.sort("age")
|
||||
.head(10)
|
||||
)
|
||||
|
||||
# Better: Use lazy mode
|
||||
result = (
|
||||
df.lazy()
|
||||
.filter(pl.col("age") > 25)
|
||||
.select("name", "age")
|
||||
.sort("age")
|
||||
.head(10)
|
||||
.collect()
|
||||
)
|
||||
```
|
||||
|
||||
## Memory Management
|
||||
|
||||
### Monitor Memory Usage
|
||||
|
||||
```python
|
||||
# Check DataFrame size
|
||||
print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")
|
||||
|
||||
# Profile memory during operations
|
||||
lf = pl.scan_csv("large.csv")
|
||||
print(lf.explain()) # See query plan
|
||||
```
|
||||
|
||||
### Reduce Memory Footprint
|
||||
|
||||
```python
|
||||
# 1. Use lazy mode
|
||||
lf = pl.scan_parquet("data.parquet")
|
||||
|
||||
# 2. Stream results
|
||||
result = lf.collect(streaming=True)
|
||||
|
||||
# 3. Select only needed columns
|
||||
lf = lf.select("col1", "col2")
|
||||
|
||||
# 4. Optimize data types
|
||||
df = df.with_columns(
|
||||
pl.col("int_col").cast(pl.Int32), # Downcast if possible
|
||||
pl.col("category").cast(pl.Categorical) # For low cardinality
|
||||
)
|
||||
|
||||
# 5. Drop columns not needed
|
||||
df = df.drop("large_text_col", "unused_col")
|
||||
```
|
||||
|
||||
## Testing and Debugging
|
||||
|
||||
### Inspect Query Plans
|
||||
|
||||
```python
|
||||
lf = pl.scan_csv("data.csv")
|
||||
query = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
|
||||
# View the optimized query plan
|
||||
print(query.explain())
|
||||
|
||||
# View detailed query plan
|
||||
print(query.explain(optimized=True))
|
||||
```
|
||||
|
||||
### Sample Data for Development
|
||||
|
||||
```python
|
||||
# Use n_rows for testing
|
||||
df = pl.read_csv("large.csv", n_rows=1000)
|
||||
|
||||
# Or sample after reading
|
||||
df_sample = df.sample(n=1000, seed=42)
|
||||
```
|
||||
|
||||
### Validate Schemas
|
||||
|
||||
```python
|
||||
# Check schema
|
||||
print(df.schema)
|
||||
|
||||
# Ensure schema matches expectation
|
||||
expected_schema = {
|
||||
"id": pl.Int64,
|
||||
"name": pl.Utf8,
|
||||
"date": pl.Date
|
||||
}
|
||||
|
||||
assert df.schema == expected_schema
|
||||
```
|
||||
|
||||
### Profile Performance
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
# Time operations
|
||||
start = time.time()
|
||||
result = lf.collect()
|
||||
print(f"Execution time: {time.time() - start:.2f}s")
|
||||
|
||||
# Compare eager vs lazy
|
||||
start = time.time()
|
||||
df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
|
||||
eager_time = time.time() - start
|
||||
|
||||
start = time.time()
|
||||
df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
|
||||
lazy_time = time.time() - start
|
||||
|
||||
print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")
|
||||
```
|
||||
|
||||
## File Format Best Practices
|
||||
|
||||
### Choose the Right Format
|
||||
|
||||
**Parquet:**
|
||||
- Best for: Large datasets, archival, data lakes
|
||||
- Pros: Excellent compression, columnar, fast reads
|
||||
- Cons: Not human-readable
|
||||
|
||||
**CSV:**
|
||||
- Best for: Small datasets, human inspection, legacy systems
|
||||
- Pros: Universal, human-readable
|
||||
- Cons: Slow, large file size, no type preservation
|
||||
|
||||
**Arrow IPC:**
|
||||
- Best for: Inter-process communication, temporary storage
|
||||
- Pros: Fastest, zero-copy, preserves all types
|
||||
- Cons: Less compression than Parquet
|
||||
|
||||
### File Reading Best Practices
|
||||
|
||||
```python
|
||||
# 1. Use lazy reading
|
||||
lf = pl.scan_parquet("data.parquet") # Not read_parquet
|
||||
|
||||
# 2. Read multiple files efficiently
|
||||
lf = pl.scan_parquet("data/*.parquet") # Parallel reading
|
||||
|
||||
# 3. Specify schema when known
|
||||
lf = pl.scan_csv(
|
||||
"data.csv",
|
||||
dtypes={"id": pl.Int64, "date": pl.Date}
|
||||
)
|
||||
|
||||
# 4. Use predicate pushdown
|
||||
result = lf.filter(pl.col("date") >= "2023-01-01").collect()
|
||||
```
|
||||
|
||||
### File Writing Best Practices
|
||||
|
||||
```python
|
||||
# 1. Use Parquet for large data
|
||||
df.write_parquet("output.parquet", compression="zstd")
|
||||
|
||||
# 2. Partition large datasets
|
||||
df.write_parquet("output", partition_by=["year", "month"])
|
||||
|
||||
# 3. Use streaming for very large writes
|
||||
lf.sink_parquet("output.parquet") # Streaming write
|
||||
|
||||
# 4. Optimize compression
|
||||
df.write_parquet(
|
||||
"output.parquet",
|
||||
compression="snappy", # Fast compression
|
||||
statistics=True # Enable predicate pushdown on read
|
||||
)
|
||||
```
|
||||
|
||||
## Code Organization
|
||||
|
||||
### Reusable Expressions
|
||||
|
||||
```python
|
||||
# Define reusable expressions
|
||||
age_group = (
|
||||
pl.when(pl.col("age") < 18)
|
||||
.then("minor")
|
||||
.when(pl.col("age") < 65)
|
||||
.then("adult")
|
||||
.otherwise("senior")
|
||||
)
|
||||
|
||||
revenue_per_customer = pl.col("revenue") / pl.col("customer_count")
|
||||
|
||||
# Use in multiple contexts
|
||||
df = df.with_columns(
|
||||
age_group=age_group,
|
||||
rpc=revenue_per_customer
|
||||
)
|
||||
|
||||
# Reuse in filtering
|
||||
df = df.filter(revenue_per_customer > 100)
|
||||
```
|
||||
|
||||
### Pipeline Functions
|
||||
|
||||
```python
|
||||
def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
|
||||
"""Clean and standardize data."""
|
||||
return lf.with_columns(
|
||||
pl.col("name").str.to_uppercase(),
|
||||
pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
|
||||
pl.col("amount").fill_null(0)
|
||||
)
|
||||
|
||||
def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
|
||||
"""Add computed features."""
|
||||
return lf.with_columns(
|
||||
month=pl.col("date").dt.month(),
|
||||
year=pl.col("date").dt.year(),
|
||||
amount_log=pl.col("amount").log()
|
||||
)
|
||||
|
||||
# Compose pipeline
|
||||
result = (
|
||||
pl.scan_csv("data.csv")
|
||||
.pipe(clean_data)
|
||||
.pipe(add_features)
|
||||
.filter(pl.col("year") == 2023)
|
||||
.collect()
|
||||
)
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
Always document complex expressions and transformations:
|
||||
|
||||
```python
|
||||
# Good: Document intent
|
||||
df = df.with_columns(
|
||||
# Calculate customer lifetime value as sum of purchases
|
||||
# divided by months since first purchase
|
||||
clv=(
|
||||
pl.col("total_purchases") /
|
||||
((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
|
||||
.dt.total_days() / 30)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
```python
|
||||
# Check Polars version
|
||||
import polars as pl
|
||||
print(pl.__version__)
|
||||
|
||||
# Feature availability varies by version
|
||||
# Document version requirements for production code
|
||||
```
|
||||
378
skills/polars/references/core_concepts.md
Normal file
378
skills/polars/references/core_concepts.md
Normal file
@@ -0,0 +1,378 @@
|
||||
# Polars Core Concepts
|
||||
|
||||
## Expressions
|
||||
|
||||
Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.
|
||||
|
||||
### What are Expressions?
|
||||
|
||||
An expression describes a transformation on data. It only materializes (executes) within specific contexts:
|
||||
- `select()` - Select and transform columns
|
||||
- `with_columns()` - Add or modify columns
|
||||
- `filter()` - Filter rows
|
||||
- `group_by().agg()` - Aggregate data
|
||||
|
||||
### Expression Syntax
|
||||
|
||||
**Basic column reference:**
|
||||
```python
|
||||
pl.col("column_name")
|
||||
```
|
||||
|
||||
**Computed expressions:**
|
||||
```python
|
||||
# Arithmetic
|
||||
pl.col("height") * 2
|
||||
pl.col("price") + pl.col("tax")
|
||||
|
||||
# With alias
|
||||
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")
|
||||
|
||||
# Method chaining
|
||||
pl.col("name").str.to_uppercase().str.slice(0, 3)
|
||||
```
|
||||
|
||||
### Expression Contexts
|
||||
|
||||
**Select context:**
|
||||
```python
|
||||
df.select(
|
||||
"name", # Simple column name
|
||||
pl.col("age"), # Expression
|
||||
(pl.col("age") * 12).alias("age_in_months") # Computed expression
|
||||
)
|
||||
```
|
||||
|
||||
**With_columns context:**
|
||||
```python
|
||||
df.with_columns(
|
||||
age_doubled=pl.col("age") * 2,
|
||||
name_upper=pl.col("name").str.to_uppercase()
|
||||
)
|
||||
```
|
||||
|
||||
**Filter context:**
|
||||
```python
|
||||
df.filter(
|
||||
pl.col("age") > 25,
|
||||
pl.col("city").is_in(["NY", "LA", "SF"])
|
||||
)
|
||||
```
|
||||
|
||||
**Group_by context:**
|
||||
```python
|
||||
df.group_by("department").agg(
|
||||
pl.col("salary").mean(),
|
||||
pl.col("employee_id").count()
|
||||
)
|
||||
```
|
||||
|
||||
### Expression Expansion
|
||||
|
||||
Apply operations to multiple columns at once:
|
||||
|
||||
**All columns:**
|
||||
```python
|
||||
df.select(pl.all() * 2)
|
||||
```
|
||||
|
||||
**Pattern matching:**
|
||||
```python
|
||||
# All columns ending with "_value"
|
||||
df.select(pl.col("^.*_value$") * 100)
|
||||
|
||||
# All numeric columns
|
||||
df.select(pl.col(pl.NUMERIC_DTYPES) + 1)
|
||||
```
|
||||
|
||||
**Exclude patterns:**
|
||||
```python
|
||||
df.select(pl.all().exclude("id", "name"))
|
||||
```
|
||||
|
||||
### Expression Composition
|
||||
|
||||
Expressions can be stored and reused:
|
||||
|
||||
```python
|
||||
# Define reusable expressions
|
||||
age_expression = pl.col("age") * 12
|
||||
name_expression = pl.col("name").str.to_uppercase()
|
||||
|
||||
# Use in multiple contexts
|
||||
df.select(age_expression, name_expression)
|
||||
df.with_columns(age_months=age_expression)
|
||||
```
|
||||
|
||||
## Data Types
|
||||
|
||||
Polars has a strict type system based on Apache Arrow.
|
||||
|
||||
### Core Data Types
|
||||
|
||||
**Numeric:**
|
||||
- `Int8`, `Int16`, `Int32`, `Int64` - Signed integers
|
||||
- `UInt8`, `UInt16`, `UInt32`, `UInt64` - Unsigned integers
|
||||
- `Float32`, `Float64` - Floating point numbers
|
||||
|
||||
**Text:**
|
||||
- `Utf8` / `String` - UTF-8 encoded strings
|
||||
- `Categorical` - Categorized strings (low cardinality)
|
||||
- `Enum` - Fixed set of string values
|
||||
|
||||
**Temporal:**
|
||||
- `Date` - Calendar date (no time)
|
||||
- `Datetime` - Date and time with optional timezone
|
||||
- `Time` - Time of day
|
||||
- `Duration` - Time duration/difference
|
||||
|
||||
**Boolean:**
|
||||
- `Boolean` - True/False values
|
||||
|
||||
**Nested:**
|
||||
- `List` - Variable-length lists
|
||||
- `Array` - Fixed-length arrays
|
||||
- `Struct` - Nested record structures
|
||||
|
||||
**Other:**
|
||||
- `Binary` - Binary data
|
||||
- `Object` - Python objects (avoid in production)
|
||||
- `Null` - Null type
|
||||
|
||||
### Type Casting
|
||||
|
||||
Convert between types explicitly:
|
||||
|
||||
```python
|
||||
# Cast to different type
|
||||
df.select(
|
||||
pl.col("age").cast(pl.Float64),
|
||||
pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
|
||||
pl.col("id").cast(pl.Utf8)
|
||||
)
|
||||
```
|
||||
|
||||
### Null Handling
|
||||
|
||||
Polars uses consistent null handling across all types:
|
||||
|
||||
**Check for nulls:**
|
||||
```python
|
||||
df.filter(pl.col("value").is_null())
|
||||
df.filter(pl.col("value").is_not_null())
|
||||
```
|
||||
|
||||
**Fill nulls:**
|
||||
```python
|
||||
pl.col("value").fill_null(0)
|
||||
pl.col("value").fill_null(strategy="forward")
|
||||
pl.col("value").fill_null(strategy="backward")
|
||||
pl.col("value").fill_null(strategy="mean")
|
||||
```
|
||||
|
||||
**Drop nulls:**
|
||||
```python
|
||||
df.drop_nulls() # Drop any row with nulls
|
||||
df.drop_nulls(subset=["col1", "col2"]) # Drop rows with nulls in specific columns
|
||||
```
|
||||
|
||||
### Categorical Data
|
||||
|
||||
Use categorical types for string columns with low cardinality (repeated values):
|
||||
|
||||
```python
|
||||
# Cast to categorical
|
||||
df.with_columns(
|
||||
pl.col("category").cast(pl.Categorical)
|
||||
)
|
||||
|
||||
# Benefits:
|
||||
# - Reduced memory usage
|
||||
# - Faster grouping and joining
|
||||
# - Maintains order information
|
||||
```
|
||||
|
||||
## Lazy vs Eager Evaluation
|
||||
|
||||
Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).
|
||||
|
||||
### Eager Evaluation (DataFrame)
|
||||
|
||||
Operations execute immediately:
|
||||
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# DataFrame operations execute right away
|
||||
df = pl.read_csv("data.csv") # Reads file immediately
|
||||
result = df.filter(pl.col("age") > 25) # Filters immediately
|
||||
final = result.select("name", "age") # Selects immediately
|
||||
```
|
||||
|
||||
**When to use eager:**
|
||||
- Small datasets that fit in memory
|
||||
- Interactive exploration in notebooks
|
||||
- Simple one-off operations
|
||||
- Immediate feedback needed
|
||||
|
||||
### Lazy Evaluation (LazyFrame)
|
||||
|
||||
Operations build a query plan, optimized before execution:
|
||||
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# LazyFrame operations build a query plan
|
||||
lf = pl.scan_csv("data.csv") # Doesn't read yet
|
||||
lf2 = lf.filter(pl.col("age") > 25) # Adds to plan
|
||||
lf3 = lf2.select("name", "age") # Adds to plan
|
||||
df = lf3.collect() # NOW executes optimized plan
|
||||
```
|
||||
|
||||
**When to use lazy:**
|
||||
- Large datasets
|
||||
- Complex query pipelines
|
||||
- Only need subset of data
|
||||
- Performance is critical
|
||||
- Streaming required
|
||||
|
||||
### Query Optimization
|
||||
|
||||
Polars automatically optimizes lazy queries:
|
||||
|
||||
**Predicate Pushdown:**
|
||||
Filter operations pushed to data source when possible:
|
||||
```python
|
||||
# Only reads rows where age > 25 from CSV
|
||||
lf = pl.scan_csv("data.csv")
|
||||
result = lf.filter(pl.col("age") > 25).collect()
|
||||
```
|
||||
|
||||
**Projection Pushdown:**
|
||||
Only read needed columns from data source:
|
||||
```python
|
||||
# Only reads "name" and "age" columns from CSV
|
||||
lf = pl.scan_csv("data.csv")
|
||||
result = lf.select("name", "age").collect()
|
||||
```
|
||||
|
||||
**Query Plan Inspection:**
|
||||
```python
|
||||
# View the optimized query plan
|
||||
lf = pl.scan_csv("data.csv")
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
print(result.explain()) # Shows optimized plan
|
||||
```
|
||||
|
||||
### Streaming Mode
|
||||
|
||||
Process data larger than memory:
|
||||
|
||||
```python
|
||||
# Enable streaming for very large datasets
|
||||
lf = pl.scan_csv("very_large.csv")
|
||||
result = lf.filter(pl.col("age") > 25).collect(streaming=True)
|
||||
```
|
||||
|
||||
**Streaming benefits:**
|
||||
- Process data larger than RAM
|
||||
- Lower peak memory usage
|
||||
- Chunk-based processing
|
||||
- Automatic memory management
|
||||
|
||||
**Streaming limitations:**
|
||||
- Not all operations support streaming
|
||||
- May be slower for small data
|
||||
- Some operations require materializing entire dataset
|
||||
|
||||
### Converting Between Eager and Lazy
|
||||
|
||||
**Eager to Lazy:**
|
||||
```python
|
||||
df = pl.read_csv("data.csv")
|
||||
lf = df.lazy() # Convert to LazyFrame
|
||||
```
|
||||
|
||||
**Lazy to Eager:**
|
||||
```python
|
||||
lf = pl.scan_csv("data.csv")
|
||||
df = lf.collect() # Execute and return DataFrame
|
||||
```
|
||||
|
||||
## Memory Format
|
||||
|
||||
Polars uses Apache Arrow columnar memory format:
|
||||
|
||||
**Benefits:**
|
||||
- Zero-copy data sharing with other Arrow libraries
|
||||
- Efficient columnar operations
|
||||
- SIMD vectorization
|
||||
- Reduced memory overhead
|
||||
- Fast serialization
|
||||
|
||||
**Implications:**
|
||||
- Data stored column-wise, not row-wise
|
||||
- Column operations very fast
|
||||
- Random row access slower than pandas
|
||||
- Best for analytical workloads
|
||||
|
||||
## Parallelization
|
||||
|
||||
Polars parallelizes operations automatically using Rust's concurrency:
|
||||
|
||||
**What gets parallelized:**
|
||||
- Aggregations within groups
|
||||
- Window functions
|
||||
- Most expression evaluations
|
||||
- File reading (multiple files)
|
||||
- Join operations
|
||||
|
||||
**What to avoid for parallelization:**
|
||||
- Python user-defined functions (UDFs)
|
||||
- Lambda functions in `.map_elements()`
|
||||
- Sequential `.pipe()` chains
|
||||
|
||||
**Best practice:**
|
||||
```python
|
||||
# Good: Stays in expression API (parallelized)
|
||||
df.with_columns(
|
||||
pl.col("value") * 10,
|
||||
pl.col("value").log(),
|
||||
pl.col("value").sqrt()
|
||||
)
|
||||
|
||||
# Bad: Uses Python function (sequential)
|
||||
df.with_columns(
|
||||
pl.col("value").map_elements(lambda x: x * 10)
|
||||
)
|
||||
```
|
||||
|
||||
## Strict Type System
|
||||
|
||||
Polars enforces strict typing:
|
||||
|
||||
**No silent conversions:**
|
||||
```python
|
||||
# This will error - can't mix types
|
||||
# df.with_columns(pl.col("int_col") + "string")
|
||||
|
||||
# Must cast explicitly
|
||||
df.with_columns(
|
||||
pl.col("int_col").cast(pl.Utf8) + "_suffix"
|
||||
)
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Prevents silent bugs
|
||||
- Predictable behavior
|
||||
- Better performance
|
||||
- Clearer code intent
|
||||
|
||||
**Integer nulls:**
|
||||
Unlike pandas, integer columns can have nulls without converting to float:
|
||||
```python
|
||||
# In pandas: Int column with null becomes Float
|
||||
# In polars: Int column with null stays Int (with null values)
|
||||
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
|
||||
# dtype: Int64 (not Float64)
|
||||
```
|
||||
557
skills/polars/references/io_guide.md
Normal file
557
skills/polars/references/io_guide.md
Normal file
@@ -0,0 +1,557 @@
|
||||
# Polars Data I/O Guide
|
||||
|
||||
Comprehensive guide to reading and writing data in various formats with Polars.
|
||||
|
||||
## CSV Files
|
||||
|
||||
### Reading CSV
|
||||
|
||||
**Eager mode (loads into memory):**
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# Basic read
|
||||
df = pl.read_csv("data.csv")
|
||||
|
||||
# With options
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
separator=",",
|
||||
has_header=True,
|
||||
columns=["col1", "col2"], # Select specific columns
|
||||
n_rows=1000, # Read only first 1000 rows
|
||||
skip_rows=10, # Skip first 10 rows
|
||||
dtypes={"col1": pl.Int64, "col2": pl.Utf8}, # Specify types
|
||||
null_values=["NA", "null", ""], # Define null values
|
||||
encoding="utf-8",
|
||||
ignore_errors=False
|
||||
)
|
||||
```
|
||||
|
||||
**Lazy mode (scans without loading - recommended for large files):**
|
||||
```python
|
||||
# Scan CSV (builds query plan)
|
||||
lf = pl.scan_csv("data.csv")
|
||||
|
||||
# Apply operations
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
|
||||
# Execute and load
|
||||
df = result.collect()
|
||||
```
|
||||
|
||||
### Writing CSV
|
||||
|
||||
```python
|
||||
# Basic write
|
||||
df.write_csv("output.csv")
|
||||
|
||||
# With options
|
||||
df.write_csv(
|
||||
"output.csv",
|
||||
separator=",",
|
||||
include_header=True,
|
||||
null_value="", # How to represent nulls
|
||||
quote_char='"',
|
||||
line_terminator="\n"
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple CSV Files
|
||||
|
||||
**Read multiple files:**
|
||||
```python
|
||||
# Read all CSVs in directory
|
||||
lf = pl.scan_csv("data/*.csv")
|
||||
|
||||
# Read specific files
|
||||
lf = pl.scan_csv(["file1.csv", "file2.csv", "file3.csv"])
|
||||
```
|
||||
|
||||
## Parquet Files
|
||||
|
||||
Parquet is the recommended format for performance and compression.
|
||||
|
||||
### Reading Parquet
|
||||
|
||||
**Eager:**
|
||||
```python
|
||||
df = pl.read_parquet("data.parquet")
|
||||
|
||||
# With options
|
||||
df = pl.read_parquet(
|
||||
"data.parquet",
|
||||
columns=["col1", "col2"], # Select specific columns
|
||||
n_rows=1000, # Read first N rows
|
||||
parallel="auto" # Control parallelization
|
||||
)
|
||||
```
|
||||
|
||||
**Lazy (recommended):**
|
||||
```python
|
||||
lf = pl.scan_parquet("data.parquet")
|
||||
|
||||
# Automatic predicate and projection pushdown
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
|
||||
```
|
||||
|
||||
### Writing Parquet
|
||||
|
||||
```python
|
||||
# Basic write
|
||||
df.write_parquet("output.parquet")
|
||||
|
||||
# With compression
|
||||
df.write_parquet(
|
||||
"output.parquet",
|
||||
compression="snappy", # Options: "snappy", "gzip", "brotli", "lz4", "zstd"
|
||||
statistics=True, # Write statistics (enables predicate pushdown)
|
||||
use_pyarrow=False # Use Rust writer (faster)
|
||||
)
|
||||
```
|
||||
|
||||
### Partitioned Parquet (Hive-style)
|
||||
|
||||
**Write partitioned:**
|
||||
```python
|
||||
# Write with partitioning
|
||||
df.write_parquet(
|
||||
"output_dir",
|
||||
partition_by=["year", "month"] # Creates directory structure
|
||||
)
|
||||
# Creates: output_dir/year=2023/month=01/data.parquet
|
||||
```
|
||||
|
||||
**Read partitioned:**
|
||||
```python
|
||||
lf = pl.scan_parquet("output_dir/**/*.parquet")
|
||||
|
||||
# Hive partitioning columns are automatically added
|
||||
result = lf.filter(pl.col("year") == 2023).collect()
|
||||
```
|
||||
|
||||
## JSON Files
|
||||
|
||||
### Reading JSON
|
||||
|
||||
**NDJSON (newline-delimited JSON) - recommended:**
|
||||
```python
|
||||
df = pl.read_ndjson("data.ndjson")
|
||||
|
||||
# Lazy
|
||||
lf = pl.scan_ndjson("data.ndjson")
|
||||
```
|
||||
|
||||
**Standard JSON:**
|
||||
```python
|
||||
df = pl.read_json("data.json")
|
||||
|
||||
# From JSON string
|
||||
df = pl.read_json('{"col1": [1, 2], "col2": ["a", "b"]}')
|
||||
```
|
||||
|
||||
### Writing JSON
|
||||
|
||||
```python
|
||||
# Write NDJSON
|
||||
df.write_ndjson("output.ndjson")
|
||||
|
||||
# Write standard JSON
|
||||
df.write_json("output.json")
|
||||
|
||||
# Pretty printed
|
||||
df.write_json("output.json", pretty=True, row_oriented=False)
|
||||
```
|
||||
|
||||
## Excel Files
|
||||
|
||||
### Reading Excel
|
||||
|
||||
```python
|
||||
# Read first sheet
|
||||
df = pl.read_excel("data.xlsx")
|
||||
|
||||
# Specific sheet
|
||||
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
|
||||
# Or by index
|
||||
df = pl.read_excel("data.xlsx", sheet_id=0)
|
||||
|
||||
# With options
|
||||
df = pl.read_excel(
|
||||
"data.xlsx",
|
||||
sheet_name="Sheet1",
|
||||
columns=["A", "B", "C"], # Excel columns
|
||||
n_rows=100,
|
||||
skip_rows=5,
|
||||
has_header=True
|
||||
)
|
||||
```
|
||||
|
||||
### Writing Excel
|
||||
|
||||
```python
|
||||
# Write to Excel
|
||||
df.write_excel("output.xlsx")
|
||||
|
||||
# Multiple sheets
|
||||
with pl.ExcelWriter("output.xlsx") as writer:
|
||||
df1.write_excel(writer, worksheet="Sheet1")
|
||||
df2.write_excel(writer, worksheet="Sheet2")
|
||||
```
|
||||
|
||||
## Database Connectivity
|
||||
|
||||
### Read from Database
|
||||
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# Read entire table
|
||||
df = pl.read_database("SELECT * FROM users", connection_uri="postgresql://...")
|
||||
|
||||
# Using connectorx for better performance
|
||||
df = pl.read_database_uri(
|
||||
"SELECT * FROM users WHERE age > 25",
|
||||
uri="postgresql://user:pass@localhost/db"
|
||||
)
|
||||
```
|
||||
|
||||
### Write to Database
|
||||
|
||||
```python
|
||||
# Using SQLAlchemy
|
||||
from sqlalchemy import create_engine
|
||||
|
||||
engine = create_engine("postgresql://user:pass@localhost/db")
|
||||
df.write_database("table_name", connection=engine)
|
||||
|
||||
# With options
|
||||
df.write_database(
|
||||
"table_name",
|
||||
connection=engine,
|
||||
if_exists="replace", # or "append", "fail"
|
||||
)
|
||||
```
|
||||
|
||||
### Common Database Connectors
|
||||
|
||||
**PostgreSQL:**
|
||||
```python
|
||||
uri = "postgresql://username:password@localhost:5432/database"
|
||||
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
|
||||
```
|
||||
|
||||
**MySQL:**
|
||||
```python
|
||||
uri = "mysql://username:password@localhost:3306/database"
|
||||
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
|
||||
```
|
||||
|
||||
**SQLite:**
|
||||
```python
|
||||
uri = "sqlite:///path/to/database.db"
|
||||
df = pl.read_database_uri("SELECT * FROM table", uri=uri)
|
||||
```
|
||||
|
||||
## Cloud Storage
|
||||
|
||||
### AWS S3
|
||||
|
||||
```python
|
||||
# Read from S3
|
||||
df = pl.read_parquet("s3://bucket/path/to/file.parquet")
|
||||
lf = pl.scan_parquet("s3://bucket/path/*.parquet")
|
||||
|
||||
# Write to S3
|
||||
df.write_parquet("s3://bucket/path/output.parquet")
|
||||
|
||||
# With credentials
|
||||
import os
|
||||
os.environ["AWS_ACCESS_KEY_ID"] = "your_key"
|
||||
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret"
|
||||
os.environ["AWS_REGION"] = "us-west-2"
|
||||
|
||||
df = pl.read_parquet("s3://bucket/file.parquet")
|
||||
```
|
||||
|
||||
### Azure Blob Storage
|
||||
|
||||
```python
|
||||
# Read from Azure
|
||||
df = pl.read_parquet("az://container/path/file.parquet")
|
||||
|
||||
# Write to Azure
|
||||
df.write_parquet("az://container/path/output.parquet")
|
||||
|
||||
# With credentials
|
||||
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "account"
|
||||
os.environ["AZURE_STORAGE_ACCOUNT_KEY"] = "key"
|
||||
```
|
||||
|
||||
### Google Cloud Storage (GCS)
|
||||
|
||||
```python
|
||||
# Read from GCS
|
||||
df = pl.read_parquet("gs://bucket/path/file.parquet")
|
||||
|
||||
# Write to GCS
|
||||
df.write_parquet("gs://bucket/path/output.parquet")
|
||||
|
||||
# With credentials
|
||||
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials.json"
|
||||
```
|
||||
|
||||
## Google BigQuery
|
||||
|
||||
```python
|
||||
# Read from BigQuery
|
||||
df = pl.read_database(
|
||||
"SELECT * FROM project.dataset.table",
|
||||
connection_uri="bigquery://project"
|
||||
)
|
||||
|
||||
# Or using Google Cloud SDK
|
||||
from google.cloud import bigquery
|
||||
client = bigquery.Client()
|
||||
|
||||
query = "SELECT * FROM project.dataset.table WHERE date > '2023-01-01'"
|
||||
df = pl.from_pandas(client.query(query).to_dataframe())
|
||||
```
|
||||
|
||||
## Apache Arrow
|
||||
|
||||
### IPC/Feather Format
|
||||
|
||||
**Read:**
|
||||
```python
|
||||
df = pl.read_ipc("data.arrow")
|
||||
lf = pl.scan_ipc("data.arrow")
|
||||
```
|
||||
|
||||
**Write:**
|
||||
```python
|
||||
df.write_ipc("output.arrow")
|
||||
|
||||
# Compressed
|
||||
df.write_ipc("output.arrow", compression="zstd")
|
||||
```
|
||||
|
||||
### Arrow Streaming
|
||||
|
||||
```python
|
||||
# Write streaming format
|
||||
df.write_ipc("output.arrows", compression="zstd")
|
||||
|
||||
# Read streaming
|
||||
df = pl.read_ipc("output.arrows")
|
||||
```
|
||||
|
||||
### From/To Arrow
|
||||
|
||||
```python
|
||||
import pyarrow as pa
|
||||
|
||||
# From Arrow Table
|
||||
arrow_table = pa.table({"col": [1, 2, 3]})
|
||||
df = pl.from_arrow(arrow_table)
|
||||
|
||||
# To Arrow Table
|
||||
arrow_table = df.to_arrow()
|
||||
```
|
||||
|
||||
## In-Memory Formats
|
||||
|
||||
### Python Dictionaries
|
||||
|
||||
```python
|
||||
# From dict
|
||||
df = pl.DataFrame({
|
||||
"col1": [1, 2, 3],
|
||||
"col2": ["a", "b", "c"]
|
||||
})
|
||||
|
||||
# To dict
|
||||
data_dict = df.to_dict() # Column-oriented
|
||||
data_dict = df.to_dict(as_series=False) # Lists instead of Series
|
||||
```
|
||||
|
||||
### NumPy Arrays
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# From NumPy
|
||||
arr = np.array([[1, 2], [3, 4], [5, 6]])
|
||||
df = pl.DataFrame(arr, schema=["col1", "col2"])
|
||||
|
||||
# To NumPy
|
||||
arr = df.to_numpy()
|
||||
```
|
||||
|
||||
### Pandas DataFrames
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# From Pandas
|
||||
pd_df = pd.DataFrame({"col": [1, 2, 3]})
|
||||
pl_df = pl.from_pandas(pd_df)
|
||||
|
||||
# To Pandas
|
||||
pd_df = pl_df.to_pandas()
|
||||
|
||||
# Zero-copy when possible
|
||||
pl_df = pl.from_arrow(pd_df)
|
||||
```
|
||||
|
||||
### Lists of Rows
|
||||
|
||||
```python
|
||||
# From list of dicts
|
||||
data = [
|
||||
{"name": "Alice", "age": 25},
|
||||
{"name": "Bob", "age": 30}
|
||||
]
|
||||
df = pl.DataFrame(data)
|
||||
|
||||
# To list of dicts
|
||||
rows = df.to_dicts()
|
||||
|
||||
# From list of tuples
|
||||
data = [("Alice", 25), ("Bob", 30)]
|
||||
df = pl.DataFrame(data, schema=["name", "age"])
|
||||
```
|
||||
|
||||
## Streaming Large Files
|
||||
|
||||
For datasets larger than memory, use lazy mode with streaming:
|
||||
|
||||
```python
|
||||
# Streaming mode
|
||||
lf = pl.scan_csv("very_large.csv")
|
||||
result = lf.filter(pl.col("value") > 100).collect(streaming=True)
|
||||
|
||||
# Streaming with multiple files
|
||||
lf = pl.scan_parquet("data/*.parquet")
|
||||
result = lf.group_by("category").agg(pl.col("value").sum()).collect(streaming=True)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Format Selection
|
||||
|
||||
**Use Parquet when:**
|
||||
- Need compression (up to 10x smaller than CSV)
|
||||
- Want fast reads/writes
|
||||
- Need to preserve data types
|
||||
- Working with large datasets
|
||||
- Need predicate pushdown
|
||||
|
||||
**Use CSV when:**
|
||||
- Need human-readable format
|
||||
- Interfacing with legacy systems
|
||||
- Data is small
|
||||
- Need universal compatibility
|
||||
|
||||
**Use JSON when:**
|
||||
- Working with nested/hierarchical data
|
||||
- Need web API compatibility
|
||||
- Data has flexible schema
|
||||
|
||||
**Use Arrow IPC when:**
|
||||
- Need zero-copy data sharing
|
||||
- Fastest serialization required
|
||||
- Working between Arrow-compatible systems
|
||||
|
||||
### Reading Large Files
|
||||
|
||||
```python
|
||||
# 1. Always use lazy mode
|
||||
lf = pl.scan_csv("large.csv") # NOT read_csv
|
||||
|
||||
# 2. Filter and select early (pushdown optimization)
|
||||
result = (
|
||||
lf
|
||||
.select("col1", "col2", "col3") # Only needed columns
|
||||
.filter(pl.col("date") > "2023-01-01") # Filter early
|
||||
.collect()
|
||||
)
|
||||
|
||||
# 3. Use streaming for very large data
|
||||
result = lf.filter(...).select(...).collect(streaming=True)
|
||||
|
||||
# 4. Read only needed rows during development
|
||||
df = pl.read_csv("large.csv", n_rows=10000) # Sample for testing
|
||||
```
|
||||
|
||||
### Writing Large Files
|
||||
|
||||
```python
|
||||
# 1. Use Parquet with compression
|
||||
df.write_parquet("output.parquet", compression="zstd")
|
||||
|
||||
# 2. Use partitioning for very large datasets
|
||||
df.write_parquet("output", partition_by=["year", "month"])
|
||||
|
||||
# 3. Write streaming
|
||||
lf = pl.scan_csv("input.csv")
|
||||
lf.sink_parquet("output.parquet") # Streaming write
|
||||
```
|
||||
|
||||
### Performance Tips
|
||||
|
||||
```python
|
||||
# 1. Specify dtypes when reading CSV
|
||||
df = pl.read_csv(
|
||||
"data.csv",
|
||||
dtypes={"id": pl.Int64, "name": pl.Utf8} # Avoids inference
|
||||
)
|
||||
|
||||
# 2. Use appropriate compression
|
||||
df.write_parquet("output.parquet", compression="snappy") # Fast
|
||||
df.write_parquet("output.parquet", compression="zstd") # Better compression
|
||||
|
||||
# 3. Parallel reading
|
||||
df = pl.read_csv("data.csv", parallel="auto")
|
||||
|
||||
# 4. Read multiple files in parallel
|
||||
lf = pl.scan_parquet("data/*.parquet") # Automatic parallel read
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
df = pl.read_csv("data.csv")
|
||||
except pl.exceptions.ComputeError as e:
|
||||
print(f"Error reading CSV: {e}")
|
||||
|
||||
# Ignore errors during parsing
|
||||
df = pl.read_csv("messy.csv", ignore_errors=True)
|
||||
|
||||
# Handle missing files
|
||||
from pathlib import Path
|
||||
if Path("data.csv").exists():
|
||||
df = pl.read_csv("data.csv")
|
||||
else:
|
||||
print("File not found")
|
||||
```
|
||||
|
||||
## Schema Management
|
||||
|
||||
```python
|
||||
# Infer schema from sample
|
||||
schema = pl.read_csv("data.csv", n_rows=1000).schema
|
||||
|
||||
# Use inferred schema for full read
|
||||
df = pl.read_csv("data.csv", dtypes=schema)
|
||||
|
||||
# Define schema explicitly
|
||||
schema = {
|
||||
"id": pl.Int64,
|
||||
"name": pl.Utf8,
|
||||
"date": pl.Date,
|
||||
"value": pl.Float64
|
||||
}
|
||||
df = pl.read_csv("data.csv", dtypes=schema)
|
||||
```
|
||||
602
skills/polars/references/operations.md
Normal file
602
skills/polars/references/operations.md
Normal file
@@ -0,0 +1,602 @@
|
||||
# Polars Operations Reference
|
||||
|
||||
This reference covers all common Polars operations with comprehensive examples.
|
||||
|
||||
## Selection Operations
|
||||
|
||||
### Select Columns
|
||||
|
||||
**Basic selection:**
|
||||
```python
|
||||
# Select specific columns
|
||||
df.select("name", "age", "city")
|
||||
|
||||
# Using expressions
|
||||
df.select(pl.col("name"), pl.col("age"))
|
||||
```
|
||||
|
||||
**Pattern-based selection:**
|
||||
```python
|
||||
# All columns starting with "sales_"
|
||||
df.select(pl.col("^sales_.*$"))
|
||||
|
||||
# All numeric columns
|
||||
df.select(pl.col(pl.NUMERIC_DTYPES))
|
||||
|
||||
# All columns except specific ones
|
||||
df.select(pl.all().exclude("id", "timestamp"))
|
||||
```
|
||||
|
||||
**Computed columns:**
|
||||
```python
|
||||
df.select(
|
||||
"name",
|
||||
(pl.col("age") * 12).alias("age_in_months"),
|
||||
(pl.col("salary") * 1.1).alias("salary_after_raise")
|
||||
)
|
||||
```
|
||||
|
||||
### With Columns (Add/Modify)
|
||||
|
||||
Add new columns or modify existing ones while preserving all other columns:
|
||||
|
||||
```python
|
||||
# Add new columns
|
||||
df.with_columns(
|
||||
age_doubled=pl.col("age") * 2,
|
||||
full_name=pl.col("first_name") + " " + pl.col("last_name")
|
||||
)
|
||||
|
||||
# Modify existing columns
|
||||
df.with_columns(
|
||||
pl.col("name").str.to_uppercase().alias("name"),
|
||||
pl.col("salary").cast(pl.Float64).alias("salary")
|
||||
)
|
||||
|
||||
# Multiple operations in parallel
|
||||
df.with_columns(
|
||||
pl.col("value") * 10,
|
||||
pl.col("value") * 100,
|
||||
pl.col("value") * 1000,
|
||||
)
|
||||
```
|
||||
|
||||
## Filtering Operations
|
||||
|
||||
### Basic Filtering
|
||||
|
||||
```python
|
||||
# Single condition
|
||||
df.filter(pl.col("age") > 25)
|
||||
|
||||
# Multiple conditions (AND)
|
||||
df.filter(
|
||||
pl.col("age") > 25,
|
||||
pl.col("city") == "NY"
|
||||
)
|
||||
|
||||
# OR conditions
|
||||
df.filter(
|
||||
(pl.col("age") > 30) | (pl.col("salary") > 100000)
|
||||
)
|
||||
|
||||
# NOT condition
|
||||
df.filter(~pl.col("active"))
|
||||
df.filter(pl.col("city") != "NY")
|
||||
```
|
||||
|
||||
### Advanced Filtering
|
||||
|
||||
**String operations:**
|
||||
```python
|
||||
# Contains substring
|
||||
df.filter(pl.col("name").str.contains("John"))
|
||||
|
||||
# Starts with
|
||||
df.filter(pl.col("email").str.starts_with("admin"))
|
||||
|
||||
# Regex match
|
||||
df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{3}-\d{4}$"))
|
||||
```
|
||||
|
||||
**Membership checks:**
|
||||
```python
|
||||
# In list
|
||||
df.filter(pl.col("city").is_in(["NY", "LA", "SF"]))
|
||||
|
||||
# Not in list
|
||||
df.filter(~pl.col("status").is_in(["inactive", "deleted"]))
|
||||
```
|
||||
|
||||
**Range filters:**
|
||||
```python
|
||||
# Between values
|
||||
df.filter(pl.col("age").is_between(25, 35))
|
||||
|
||||
# Date range
|
||||
df.filter(
|
||||
pl.col("date") >= pl.date(2023, 1, 1),
|
||||
pl.col("date") <= pl.date(2023, 12, 31)
|
||||
)
|
||||
```
|
||||
|
||||
**Null filtering:**
|
||||
```python
|
||||
# Filter out nulls
|
||||
df.filter(pl.col("value").is_not_null())
|
||||
|
||||
# Keep only nulls
|
||||
df.filter(pl.col("value").is_null())
|
||||
```
|
||||
|
||||
## Grouping and Aggregation
|
||||
|
||||
### Basic Group By
|
||||
|
||||
```python
|
||||
# Group by single column
|
||||
df.group_by("department").agg(
|
||||
pl.col("salary").mean().alias("avg_salary"),
|
||||
pl.len().alias("employee_count")
|
||||
)
|
||||
|
||||
# Group by multiple columns
|
||||
df.group_by("department", "location").agg(
|
||||
pl.col("salary").sum()
|
||||
)
|
||||
|
||||
# Maintain order
|
||||
df.group_by("category", maintain_order=True).agg(
|
||||
pl.col("value").sum()
|
||||
)
|
||||
```
|
||||
|
||||
### Aggregation Functions
|
||||
|
||||
**Count and length:**
|
||||
```python
|
||||
df.group_by("category").agg(
|
||||
pl.len().alias("count"),
|
||||
pl.col("id").count().alias("non_null_count"),
|
||||
pl.col("id").n_unique().alias("unique_count")
|
||||
)
|
||||
```
|
||||
|
||||
**Statistical aggregations:**
|
||||
```python
|
||||
df.group_by("group").agg(
|
||||
pl.col("value").sum().alias("total"),
|
||||
pl.col("value").mean().alias("average"),
|
||||
pl.col("value").median().alias("median"),
|
||||
pl.col("value").std().alias("std_dev"),
|
||||
pl.col("value").var().alias("variance"),
|
||||
pl.col("value").min().alias("minimum"),
|
||||
pl.col("value").max().alias("maximum"),
|
||||
pl.col("value").quantile(0.95).alias("p95")
|
||||
)
|
||||
```
|
||||
|
||||
**First and last:**
|
||||
```python
|
||||
df.group_by("user_id").agg(
|
||||
pl.col("timestamp").first().alias("first_seen"),
|
||||
pl.col("timestamp").last().alias("last_seen"),
|
||||
pl.col("event").first().alias("first_event")
|
||||
)
|
||||
```
|
||||
|
||||
**List aggregation:**
|
||||
```python
|
||||
# Collect values into lists
|
||||
df.group_by("category").agg(
|
||||
pl.col("item").alias("all_items") # Creates list column
|
||||
)
|
||||
```
|
||||
|
||||
### Conditional Aggregations
|
||||
|
||||
Filter within aggregations:
|
||||
|
||||
```python
|
||||
df.group_by("department").agg(
|
||||
# Count high earners
|
||||
(pl.col("salary") > 100000).sum().alias("high_earners"),
|
||||
|
||||
# Average of filtered values
|
||||
pl.col("salary").filter(pl.col("bonus") > 0).mean().alias("avg_with_bonus"),
|
||||
|
||||
# Conditional sum
|
||||
pl.when(pl.col("active"))
|
||||
.then(pl.col("sales"))
|
||||
.otherwise(0)
|
||||
.sum()
|
||||
.alias("active_sales")
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple Aggregations
|
||||
|
||||
Combine multiple aggregations efficiently:
|
||||
|
||||
```python
|
||||
df.group_by("store_id").agg(
|
||||
pl.col("transaction_id").count().alias("num_transactions"),
|
||||
pl.col("amount").sum().alias("total_sales"),
|
||||
pl.col("amount").mean().alias("avg_transaction"),
|
||||
pl.col("customer_id").n_unique().alias("unique_customers"),
|
||||
pl.col("amount").max().alias("largest_transaction"),
|
||||
pl.col("timestamp").min().alias("first_transaction_date"),
|
||||
pl.col("timestamp").max().alias("last_transaction_date")
|
||||
)
|
||||
```
|
||||
|
||||
## Window Functions
|
||||
|
||||
Window functions apply aggregations while preserving the original row count.
|
||||
|
||||
### Basic Window Operations
|
||||
|
||||
**Group statistics:**
|
||||
```python
|
||||
# Add group mean to each row
|
||||
df.with_columns(
|
||||
avg_age_by_dept=pl.col("age").mean().over("department")
|
||||
)
|
||||
|
||||
# Multiple group columns
|
||||
df.with_columns(
|
||||
group_avg=pl.col("value").mean().over("category", "region")
|
||||
)
|
||||
```
|
||||
|
||||
**Ranking:**
|
||||
```python
|
||||
df.with_columns(
|
||||
# Rank within groups
|
||||
rank=pl.col("score").rank().over("team"),
|
||||
|
||||
# Dense rank (no gaps)
|
||||
dense_rank=pl.col("score").rank(method="dense").over("team"),
|
||||
|
||||
# Row number
|
||||
row_num=pl.col("timestamp").sort().rank(method="ordinal").over("user_id")
|
||||
)
|
||||
```
|
||||
|
||||
### Window Mapping Strategies
|
||||
|
||||
**group_to_rows (default):**
|
||||
Preserves original row order:
|
||||
```python
|
||||
df.with_columns(
|
||||
group_mean=pl.col("value").mean().over("category", mapping_strategy="group_to_rows")
|
||||
)
|
||||
```
|
||||
|
||||
**explode:**
|
||||
Faster, groups rows together:
|
||||
```python
|
||||
df.with_columns(
|
||||
group_mean=pl.col("value").mean().over("category", mapping_strategy="explode")
|
||||
)
|
||||
```
|
||||
|
||||
**join:**
|
||||
Creates list columns:
|
||||
```python
|
||||
df.with_columns(
|
||||
group_values=pl.col("value").over("category", mapping_strategy="join")
|
||||
)
|
||||
```
|
||||
|
||||
### Rolling Windows
|
||||
|
||||
**Time-based rolling:**
|
||||
```python
|
||||
df.with_columns(
|
||||
rolling_avg=pl.col("value").rolling_mean(
|
||||
window_size="7d",
|
||||
by="date"
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**Row-based rolling:**
|
||||
```python
|
||||
df.with_columns(
|
||||
rolling_sum=pl.col("value").rolling_sum(window_size=3),
|
||||
rolling_max=pl.col("value").rolling_max(window_size=5)
|
||||
)
|
||||
```
|
||||
|
||||
### Cumulative Operations
|
||||
|
||||
```python
|
||||
df.with_columns(
|
||||
cumsum=pl.col("value").cum_sum().over("group"),
|
||||
cummax=pl.col("value").cum_max().over("group"),
|
||||
cummin=pl.col("value").cum_min().over("group"),
|
||||
cumprod=pl.col("value").cum_prod().over("group")
|
||||
)
|
||||
```
|
||||
|
||||
### Shift and Lag/Lead
|
||||
|
||||
```python
|
||||
df.with_columns(
|
||||
# Previous value (lag)
|
||||
prev_value=pl.col("value").shift(1).over("user_id"),
|
||||
|
||||
# Next value (lead)
|
||||
next_value=pl.col("value").shift(-1).over("user_id"),
|
||||
|
||||
# Calculate difference from previous
|
||||
diff=pl.col("value") - pl.col("value").shift(1).over("user_id")
|
||||
)
|
||||
```
|
||||
|
||||
## Sorting
|
||||
|
||||
### Basic Sorting
|
||||
|
||||
```python
|
||||
# Sort by single column
|
||||
df.sort("age")
|
||||
|
||||
# Sort descending
|
||||
df.sort("age", descending=True)
|
||||
|
||||
# Sort by multiple columns
|
||||
df.sort("department", "age")
|
||||
|
||||
# Mixed sorting order
|
||||
df.sort(["department", "salary"], descending=[False, True])
|
||||
```
|
||||
|
||||
### Advanced Sorting
|
||||
|
||||
**Null handling:**
|
||||
```python
|
||||
# Nulls first
|
||||
df.sort("value", nulls_last=False)
|
||||
|
||||
# Nulls last
|
||||
df.sort("value", nulls_last=True)
|
||||
```
|
||||
|
||||
**Sort by expression:**
|
||||
```python
|
||||
# Sort by computed value
|
||||
df.sort(pl.col("first_name").str.len())
|
||||
|
||||
# Sort by multiple expressions
|
||||
df.sort(
|
||||
pl.col("last_name").str.to_lowercase(),
|
||||
pl.col("age").abs()
|
||||
)
|
||||
```
|
||||
|
||||
## Conditional Operations
|
||||
|
||||
### When/Then/Otherwise
|
||||
|
||||
```python
|
||||
# Basic conditional
|
||||
df.with_columns(
|
||||
status=pl.when(pl.col("age") >= 18)
|
||||
.then("adult")
|
||||
.otherwise("minor")
|
||||
)
|
||||
|
||||
# Multiple conditions
|
||||
df.with_columns(
|
||||
category=pl.when(pl.col("score") >= 90)
|
||||
.then("A")
|
||||
.when(pl.col("score") >= 80)
|
||||
.then("B")
|
||||
.when(pl.col("score") >= 70)
|
||||
.then("C")
|
||||
.otherwise("F")
|
||||
)
|
||||
|
||||
# Conditional computation
|
||||
df.with_columns(
|
||||
adjusted_price=pl.when(pl.col("is_member"))
|
||||
.then(pl.col("price") * 0.9)
|
||||
.otherwise(pl.col("price"))
|
||||
)
|
||||
```
|
||||
|
||||
## String Operations
|
||||
|
||||
### Common String Methods
|
||||
|
||||
```python
|
||||
df.with_columns(
|
||||
# Case conversion
|
||||
upper=pl.col("name").str.to_uppercase(),
|
||||
lower=pl.col("name").str.to_lowercase(),
|
||||
title=pl.col("name").str.to_titlecase(),
|
||||
|
||||
# Trimming
|
||||
trimmed=pl.col("text").str.strip_chars(),
|
||||
|
||||
# Substring
|
||||
first_3=pl.col("name").str.slice(0, 3),
|
||||
|
||||
# Replace
|
||||
cleaned=pl.col("text").str.replace("old", "new"),
|
||||
cleaned_all=pl.col("text").str.replace_all("old", "new"),
|
||||
|
||||
# Split
|
||||
parts=pl.col("full_name").str.split(" "),
|
||||
|
||||
# Length
|
||||
name_length=pl.col("name").str.len_chars()
|
||||
)
|
||||
```
|
||||
|
||||
### String Filtering
|
||||
|
||||
```python
|
||||
# Contains
|
||||
df.filter(pl.col("email").str.contains("@gmail.com"))
|
||||
|
||||
# Starts/ends with
|
||||
df.filter(pl.col("name").str.starts_with("A"))
|
||||
df.filter(pl.col("file").str.ends_with(".csv"))
|
||||
|
||||
# Regex matching
|
||||
df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{4}$"))
|
||||
```
|
||||
|
||||
## Date and Time Operations
|
||||
|
||||
### Date Parsing
|
||||
|
||||
```python
|
||||
# Parse strings to dates
|
||||
df.with_columns(
|
||||
date=pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d"),
|
||||
datetime=pl.col("dt_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")
|
||||
)
|
||||
```
|
||||
|
||||
### Date Components
|
||||
|
||||
```python
|
||||
df.with_columns(
|
||||
year=pl.col("date").dt.year(),
|
||||
month=pl.col("date").dt.month(),
|
||||
day=pl.col("date").dt.day(),
|
||||
weekday=pl.col("date").dt.weekday(),
|
||||
hour=pl.col("datetime").dt.hour(),
|
||||
minute=pl.col("datetime").dt.minute()
|
||||
)
|
||||
```
|
||||
|
||||
### Date Arithmetic
|
||||
|
||||
```python
|
||||
# Add duration
|
||||
df.with_columns(
|
||||
next_week=pl.col("date") + pl.duration(weeks=1),
|
||||
next_month=pl.col("date") + pl.duration(months=1)
|
||||
)
|
||||
|
||||
# Difference between dates
|
||||
df.with_columns(
|
||||
days_diff=(pl.col("end_date") - pl.col("start_date")).dt.total_days()
|
||||
)
|
||||
```
|
||||
|
||||
### Date Filtering
|
||||
|
||||
```python
|
||||
# Filter by date range
|
||||
df.filter(
|
||||
pl.col("date").is_between(pl.date(2023, 1, 1), pl.date(2023, 12, 31))
|
||||
)
|
||||
|
||||
# Filter by year
|
||||
df.filter(pl.col("date").dt.year() == 2023)
|
||||
|
||||
# Filter by month
|
||||
df.filter(pl.col("date").dt.month().is_in([6, 7, 8])) # Summer months
|
||||
```
|
||||
|
||||
## List Operations
|
||||
|
||||
### Working with List Columns
|
||||
|
||||
```python
|
||||
# Create list column
|
||||
df.with_columns(
|
||||
items_list=pl.col("item1", "item2", "item3").to_list()
|
||||
)
|
||||
|
||||
# List operations
|
||||
df.with_columns(
|
||||
list_len=pl.col("items").list.len(),
|
||||
first_item=pl.col("items").list.first(),
|
||||
last_item=pl.col("items").list.last(),
|
||||
unique_items=pl.col("items").list.unique(),
|
||||
sorted_items=pl.col("items").list.sort()
|
||||
)
|
||||
|
||||
# Explode lists to rows
|
||||
df.explode("items")
|
||||
|
||||
# Filter list elements
|
||||
df.with_columns(
|
||||
filtered=pl.col("items").list.eval(pl.element() > 10)
|
||||
)
|
||||
```
|
||||
|
||||
## Struct Operations
|
||||
|
||||
### Working with Nested Structures
|
||||
|
||||
```python
|
||||
# Create struct column
|
||||
df.with_columns(
|
||||
address=pl.struct(["street", "city", "zip"])
|
||||
)
|
||||
|
||||
# Access struct fields
|
||||
df.with_columns(
|
||||
city=pl.col("address").struct.field("city")
|
||||
)
|
||||
|
||||
# Unnest struct to columns
|
||||
df.unnest("address")
|
||||
```
|
||||
|
||||
## Unique and Duplicate Operations
|
||||
|
||||
```python
|
||||
# Get unique rows
|
||||
df.unique()
|
||||
|
||||
# Unique on specific columns
|
||||
df.unique(subset=["name", "email"])
|
||||
|
||||
# Keep first/last duplicate
|
||||
df.unique(subset=["id"], keep="first")
|
||||
df.unique(subset=["id"], keep="last")
|
||||
|
||||
# Identify duplicates
|
||||
df.with_columns(
|
||||
is_duplicate=pl.col("id").is_duplicated()
|
||||
)
|
||||
|
||||
# Count duplicates
|
||||
df.group_by("email").agg(
|
||||
pl.len().alias("count")
|
||||
).filter(pl.col("count") > 1)
|
||||
```
|
||||
|
||||
## Sampling
|
||||
|
||||
```python
|
||||
# Random sample
|
||||
df.sample(n=100)
|
||||
|
||||
# Sample fraction
|
||||
df.sample(fraction=0.1)
|
||||
|
||||
# Sample with seed for reproducibility
|
||||
df.sample(n=100, seed=42)
|
||||
```
|
||||
|
||||
## Column Renaming
|
||||
|
||||
```python
|
||||
# Rename specific columns
|
||||
df.rename({"old_name": "new_name", "age": "years"})
|
||||
|
||||
# Rename with expression
|
||||
df.select(pl.col("*").name.suffix("_renamed"))
|
||||
df.select(pl.col("*").name.prefix("data_"))
|
||||
df.select(pl.col("*").name.to_uppercase())
|
||||
```
|
||||
417
skills/polars/references/pandas_migration.md
Normal file
417
skills/polars/references/pandas_migration.md
Normal file
@@ -0,0 +1,417 @@
|
||||
# Pandas to Polars Migration Guide
|
||||
|
||||
This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.
|
||||
|
||||
## Core Conceptual Differences
|
||||
|
||||
### 1. No Index System
|
||||
|
||||
**Pandas:** Uses row-based indexing system
|
||||
```python
|
||||
df.loc[0, "column"]
|
||||
df.iloc[0:5]
|
||||
df.set_index("id")
|
||||
```
|
||||
|
||||
**Polars:** Uses integer positions only
|
||||
```python
|
||||
df[0, "column"] # Row position, column name
|
||||
df[0:5] # Row slice
|
||||
# No set_index equivalent - use group_by instead
|
||||
```
|
||||
|
||||
### 2. Memory Format
|
||||
|
||||
**Pandas:** Row-oriented NumPy arrays
|
||||
**Polars:** Columnar Apache Arrow format
|
||||
|
||||
**Implications:**
|
||||
- Polars is faster for column operations
|
||||
- Polars uses less memory
|
||||
- Polars has better data sharing capabilities
|
||||
|
||||
### 3. Parallelization
|
||||
|
||||
**Pandas:** Primarily single-threaded (requires Dask for parallelism)
|
||||
**Polars:** Parallel by default using Rust's concurrency
|
||||
|
||||
### 4. Lazy Evaluation
|
||||
|
||||
**Pandas:** Only eager evaluation
|
||||
**Polars:** Both eager (DataFrame) and lazy (LazyFrame) with query optimization
|
||||
|
||||
### 5. Type Strictness
|
||||
|
||||
**Pandas:** Allows silent type conversions
|
||||
**Polars:** Strict typing, explicit casts required
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Pandas: Silently converts to float
|
||||
pd_df["int_col"] = [1, 2, None, 4] # dtype: float64
|
||||
|
||||
# Polars: Keeps as integer with null
|
||||
pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]}) # dtype: Int64
|
||||
```
|
||||
|
||||
## Operation Mappings
|
||||
|
||||
### Data Selection
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Select column | `df["col"]` or `df.col` | `df.select("col")` or `df["col"]` |
|
||||
| Select multiple | `df[["a", "b"]]` | `df.select("a", "b")` |
|
||||
| Select by position | `df.iloc[:, 0:3]` | `df.select(pl.col(df.columns[0:3]))` |
|
||||
| Select by condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
|
||||
|
||||
### Data Filtering
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Single condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
|
||||
| Multiple conditions | `df[(df["age"] > 25) & (df["city"] == "NY")]` | `df.filter(pl.col("age") > 25, pl.col("city") == "NY")` |
|
||||
| Query method | `df.query("age > 25")` | `df.filter(pl.col("age") > 25)` |
|
||||
| isin | `df[df["city"].isin(["NY", "LA"])]` | `df.filter(pl.col("city").is_in(["NY", "LA"]))` |
|
||||
| isna | `df[df["value"].isna()]` | `df.filter(pl.col("value").is_null())` |
|
||||
| notna | `df[df["value"].notna()]` | `df.filter(pl.col("value").is_not_null())` |
|
||||
|
||||
### Adding/Modifying Columns
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Add column | `df["new"] = df["old"] * 2` | `df.with_columns(new=pl.col("old") * 2)` |
|
||||
| Multiple columns | `df.assign(a=..., b=...)` | `df.with_columns(a=..., b=...)` |
|
||||
| Conditional column | `np.where(condition, a, b)` | `pl.when(condition).then(a).otherwise(b)` |
|
||||
|
||||
**Important difference - Parallel execution:**
|
||||
|
||||
```python
|
||||
# Pandas: Sequential (lambda sees previous results)
|
||||
df.assign(
|
||||
a=lambda df_: df_.value * 10,
|
||||
b=lambda df_: df_.value * 100
|
||||
)
|
||||
|
||||
# Polars: Parallel (all computed together)
|
||||
df.with_columns(
|
||||
a=pl.col("value") * 10,
|
||||
b=pl.col("value") * 100
|
||||
)
|
||||
```
|
||||
|
||||
### Grouping and Aggregation
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Group by | `df.groupby("col")` | `df.group_by("col")` |
|
||||
| Agg single | `df.groupby("col")["val"].mean()` | `df.group_by("col").agg(pl.col("val").mean())` |
|
||||
| Agg multiple | `df.groupby("col").agg({"val": ["mean", "sum"]})` | `df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum())` |
|
||||
| Size | `df.groupby("col").size()` | `df.group_by("col").agg(pl.len())` |
|
||||
| Count | `df.groupby("col").count()` | `df.group_by("col").agg(pl.col("*").count())` |
|
||||
|
||||
### Window Functions
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Transform | `df.groupby("col").transform("mean")` | `df.with_columns(pl.col("val").mean().over("col"))` |
|
||||
| Rank | `df.groupby("col")["val"].rank()` | `df.with_columns(pl.col("val").rank().over("col"))` |
|
||||
| Shift | `df.groupby("col")["val"].shift(1)` | `df.with_columns(pl.col("val").shift(1).over("col"))` |
|
||||
| Cumsum | `df.groupby("col")["val"].cumsum()` | `df.with_columns(pl.col("val").cum_sum().over("col"))` |
|
||||
|
||||
### Joins
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Inner join | `df1.merge(df2, on="id")` | `df1.join(df2, on="id", how="inner")` |
|
||||
| Left join | `df1.merge(df2, on="id", how="left")` | `df1.join(df2, on="id", how="left")` |
|
||||
| Different keys | `df1.merge(df2, left_on="a", right_on="b")` | `df1.join(df2, left_on="a", right_on="b")` |
|
||||
|
||||
### Concatenation
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Vertical | `pd.concat([df1, df2], axis=0)` | `pl.concat([df1, df2], how="vertical")` |
|
||||
| Horizontal | `pd.concat([df1, df2], axis=1)` | `pl.concat([df1, df2], how="horizontal")` |
|
||||
|
||||
### Sorting
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Sort by column | `df.sort_values("col")` | `df.sort("col")` |
|
||||
| Descending | `df.sort_values("col", ascending=False)` | `df.sort("col", descending=True)` |
|
||||
| Multiple columns | `df.sort_values(["a", "b"])` | `df.sort("a", "b")` |
|
||||
|
||||
### Reshaping
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Pivot | `df.pivot(index="a", columns="b", values="c")` | `df.pivot(values="c", index="a", columns="b")` |
|
||||
| Melt | `df.melt(id_vars="id")` | `df.unpivot(index="id")` |
|
||||
|
||||
### I/O Operations
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Read CSV | `pd.read_csv("file.csv")` | `pl.read_csv("file.csv")` or `pl.scan_csv()` |
|
||||
| Write CSV | `df.to_csv("file.csv")` | `df.write_csv("file.csv")` |
|
||||
| Read Parquet | `pd.read_parquet("file.parquet")` | `pl.read_parquet("file.parquet")` |
|
||||
| Write Parquet | `df.to_parquet("file.parquet")` | `df.write_parquet("file.parquet")` |
|
||||
| Read Excel | `pd.read_excel("file.xlsx")` | `pl.read_excel("file.xlsx")` |
|
||||
|
||||
### String Operations
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Upper | `df["col"].str.upper()` | `df.select(pl.col("col").str.to_uppercase())` |
|
||||
| Lower | `df["col"].str.lower()` | `df.select(pl.col("col").str.to_lowercase())` |
|
||||
| Contains | `df["col"].str.contains("pattern")` | `df.filter(pl.col("col").str.contains("pattern"))` |
|
||||
| Replace | `df["col"].str.replace("old", "new")` | `df.select(pl.col("col").str.replace("old", "new"))` |
|
||||
| Split | `df["col"].str.split(" ")` | `df.select(pl.col("col").str.split(" "))` |
|
||||
|
||||
### Datetime Operations
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Parse dates | `pd.to_datetime(df["col"])` | `df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d"))` |
|
||||
| Year | `df["date"].dt.year` | `df.select(pl.col("date").dt.year())` |
|
||||
| Month | `df["date"].dt.month` | `df.select(pl.col("date").dt.month())` |
|
||||
| Day | `df["date"].dt.day` | `df.select(pl.col("date").dt.day())` |
|
||||
|
||||
### Missing Data
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Drop nulls | `df.dropna()` | `df.drop_nulls()` |
|
||||
| Fill nulls | `df.fillna(0)` | `df.fill_null(0)` |
|
||||
| Check null | `df["col"].isna()` | `df.select(pl.col("col").is_null())` |
|
||||
| Forward fill | `df.fillna(method="ffill")` | `df.select(pl.col("col").fill_null(strategy="forward"))` |
|
||||
|
||||
### Other Operations
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Unique values | `df["col"].unique()` | `df["col"].unique()` |
|
||||
| Value counts | `df["col"].value_counts()` | `df["col"].value_counts()` |
|
||||
| Describe | `df.describe()` | `df.describe()` |
|
||||
| Sample | `df.sample(n=100)` | `df.sample(n=100)` |
|
||||
| Head | `df.head()` | `df.head()` |
|
||||
| Tail | `df.tail()` | `df.tail()` |
|
||||
|
||||
## Common Migration Patterns
|
||||
|
||||
### Pattern 1: Chained Operations
|
||||
|
||||
**Pandas:**
|
||||
```python
|
||||
result = (df
|
||||
.assign(new_col=lambda x: x["old_col"] * 2)
|
||||
.query("new_col > 10")
|
||||
.groupby("category")
|
||||
.agg({"value": "sum"})
|
||||
.reset_index()
|
||||
)
|
||||
```
|
||||
|
||||
**Polars:**
|
||||
```python
|
||||
result = (df
|
||||
.with_columns(new_col=pl.col("old_col") * 2)
|
||||
.filter(pl.col("new_col") > 10)
|
||||
.group_by("category")
|
||||
.agg(pl.col("value").sum())
|
||||
)
|
||||
# No reset_index needed - Polars doesn't have index
|
||||
```
|
||||
|
||||
### Pattern 2: Apply Functions
|
||||
|
||||
**Pandas:**
|
||||
```python
|
||||
# Avoid in Polars - breaks parallelization
|
||||
df["result"] = df["value"].apply(lambda x: x * 2)
|
||||
```
|
||||
|
||||
**Polars:**
|
||||
```python
|
||||
# Use expressions instead
|
||||
df = df.with_columns(result=pl.col("value") * 2)
|
||||
|
||||
# If custom function needed
|
||||
df = df.with_columns(
|
||||
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 3: Conditional Column Creation
|
||||
|
||||
**Pandas:**
|
||||
```python
|
||||
df["category"] = np.where(
|
||||
df["value"] > 100,
|
||||
"high",
|
||||
np.where(df["value"] > 50, "medium", "low")
|
||||
)
|
||||
```
|
||||
|
||||
**Polars:**
|
||||
```python
|
||||
df = df.with_columns(
|
||||
category=pl.when(pl.col("value") > 100)
|
||||
.then("high")
|
||||
.when(pl.col("value") > 50)
|
||||
.then("medium")
|
||||
.otherwise("low")
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 4: Group Transform
|
||||
|
||||
**Pandas:**
|
||||
```python
|
||||
df["group_mean"] = df.groupby("category")["value"].transform("mean")
|
||||
```
|
||||
|
||||
**Polars:**
|
||||
```python
|
||||
df = df.with_columns(
|
||||
group_mean=pl.col("value").mean().over("category")
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 5: Multiple Aggregations
|
||||
|
||||
**Pandas:**
|
||||
```python
|
||||
result = df.groupby("category").agg({
|
||||
"value": ["mean", "sum", "count"],
|
||||
"price": ["min", "max"]
|
||||
})
|
||||
```
|
||||
|
||||
**Polars:**
|
||||
```python
|
||||
result = df.group_by("category").agg(
|
||||
pl.col("value").mean().alias("value_mean"),
|
||||
pl.col("value").sum().alias("value_sum"),
|
||||
pl.col("value").count().alias("value_count"),
|
||||
pl.col("price").min().alias("price_min"),
|
||||
pl.col("price").max().alias("price_max")
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Anti-Patterns to Avoid
|
||||
|
||||
### Anti-Pattern 1: Sequential Pipe Operations
|
||||
|
||||
**Bad (disables parallelization):**
|
||||
```python
|
||||
df = df.pipe(function1).pipe(function2).pipe(function3)
|
||||
```
|
||||
|
||||
**Good (enables parallelization):**
|
||||
```python
|
||||
df = df.with_columns(
|
||||
function1_result(),
|
||||
function2_result(),
|
||||
function3_result()
|
||||
)
|
||||
```
|
||||
|
||||
### Anti-Pattern 2: Python Functions in Hot Paths
|
||||
|
||||
**Bad:**
|
||||
```python
|
||||
df = df.with_columns(
|
||||
result=pl.col("value").map_elements(lambda x: x * 2)
|
||||
)
|
||||
```
|
||||
|
||||
**Good:**
|
||||
```python
|
||||
df = df.with_columns(result=pl.col("value") * 2)
|
||||
```
|
||||
|
||||
### Anti-Pattern 3: Using Eager Reading for Large Files
|
||||
|
||||
**Bad:**
|
||||
```python
|
||||
df = pl.read_csv("large_file.csv")
|
||||
result = df.filter(pl.col("age") > 25).select("name", "age")
|
||||
```
|
||||
|
||||
**Good:**
|
||||
```python
|
||||
lf = pl.scan_csv("large_file.csv")
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
|
||||
```
|
||||
|
||||
### Anti-Pattern 4: Row Iteration
|
||||
|
||||
**Bad:**
|
||||
```python
|
||||
for row in df.iter_rows():
|
||||
# Process row
|
||||
pass
|
||||
```
|
||||
|
||||
**Good:**
|
||||
```python
|
||||
# Use vectorized operations
|
||||
df = df.with_columns(
|
||||
# Vectorized computation
|
||||
)
|
||||
```
|
||||
|
||||
## Migration Checklist
|
||||
|
||||
When migrating from pandas to Polars:
|
||||
|
||||
1. **Remove index operations** - Use integer positions or group_by
|
||||
2. **Replace apply/map with expressions** - Use Polars native operations
|
||||
3. **Update column assignment** - Use `with_columns()` instead of direct assignment
|
||||
4. **Change groupby.transform to .over()** - Window functions work differently
|
||||
5. **Update string operations** - Use `.str.to_uppercase()` instead of `.str.upper()`
|
||||
6. **Add explicit type casts** - Polars won't silently convert types
|
||||
7. **Consider lazy evaluation** - Use `scan_*` instead of `read_*` for large data
|
||||
8. **Update aggregation syntax** - More explicit in Polars
|
||||
9. **Remove reset_index calls** - Not needed in Polars
|
||||
10. **Update conditional logic** - Use `when().then().otherwise()` pattern
|
||||
|
||||
## Compatibility Layer
|
||||
|
||||
For gradual migration, you can use both libraries:
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import polars as pl
|
||||
|
||||
# Convert pandas to Polars
|
||||
pl_df = pl.from_pandas(pd_df)
|
||||
|
||||
# Convert Polars to pandas
|
||||
pd_df = pl_df.to_pandas()
|
||||
|
||||
# Use Arrow for zero-copy (when possible)
|
||||
pl_df = pl.from_arrow(pd_df)
|
||||
pd_df = pl_df.to_arrow().to_pandas()
|
||||
```
|
||||
|
||||
## When to Stick with Pandas
|
||||
|
||||
Consider staying with pandas when:
|
||||
- Working with time series requiring complex index operations
|
||||
- Need extensive ecosystem support (some libraries only support pandas)
|
||||
- Team lacks Rust/Polars expertise
|
||||
- Data is small and performance isn't critical
|
||||
- Using advanced pandas features without Polars equivalents
|
||||
|
||||
## When to Switch to Polars
|
||||
|
||||
Switch to Polars when:
|
||||
- Performance is critical
|
||||
- Working with large datasets (>1GB)
|
||||
- Need lazy evaluation and query optimization
|
||||
- Want better type safety
|
||||
- Need parallel execution by default
|
||||
- Starting a new project
|
||||
549
skills/polars/references/transformations.md
Normal file
549
skills/polars/references/transformations.md
Normal file
@@ -0,0 +1,549 @@
|
||||
# Polars Data Transformations
|
||||
|
||||
Comprehensive guide to joins, concatenation, and reshaping operations in Polars.
|
||||
|
||||
## Joins
|
||||
|
||||
Joins combine data from multiple DataFrames based on common columns.
|
||||
|
||||
### Basic Join Types
|
||||
|
||||
**Inner Join (intersection):**
|
||||
```python
|
||||
# Keep only matching rows from both DataFrames
|
||||
result = df1.join(df2, on="id", how="inner")
|
||||
```
|
||||
|
||||
**Left Join (all left + matches from right):**
|
||||
```python
|
||||
# Keep all rows from left, add matching rows from right
|
||||
result = df1.join(df2, on="id", how="left")
|
||||
```
|
||||
|
||||
**Outer Join (union):**
|
||||
```python
|
||||
# Keep all rows from both DataFrames
|
||||
result = df1.join(df2, on="id", how="outer")
|
||||
```
|
||||
|
||||
**Cross Join (Cartesian product):**
|
||||
```python
|
||||
# Every row from left with every row from right
|
||||
result = df1.join(df2, how="cross")
|
||||
```
|
||||
|
||||
**Semi Join (filtered left):**
|
||||
```python
|
||||
# Keep only left rows that have a match in right
|
||||
result = df1.join(df2, on="id", how="semi")
|
||||
```
|
||||
|
||||
**Anti Join (non-matching left):**
|
||||
```python
|
||||
# Keep only left rows that DON'T have a match in right
|
||||
result = df1.join(df2, on="id", how="anti")
|
||||
```
|
||||
|
||||
### Join Syntax Variations
|
||||
|
||||
**Single column join:**
|
||||
```python
|
||||
df1.join(df2, on="id")
|
||||
```
|
||||
|
||||
**Multiple columns join:**
|
||||
```python
|
||||
df1.join(df2, on=["id", "date"])
|
||||
```
|
||||
|
||||
**Different column names:**
|
||||
```python
|
||||
df1.join(df2, left_on="user_id", right_on="id")
|
||||
```
|
||||
|
||||
**Multiple different columns:**
|
||||
```python
|
||||
df1.join(
|
||||
df2,
|
||||
left_on=["user_id", "date"],
|
||||
right_on=["id", "timestamp"]
|
||||
)
|
||||
```
|
||||
|
||||
### Suffix Handling
|
||||
|
||||
When both DataFrames have columns with the same name (other than join keys):
|
||||
|
||||
```python
|
||||
# Add suffixes to distinguish columns
|
||||
result = df1.join(df2, on="id", suffix="_right")
|
||||
|
||||
# Results in: value, value_right (if both had "value" column)
|
||||
```
|
||||
|
||||
### Join Examples
|
||||
|
||||
**Example 1: Customer Orders**
|
||||
```python
|
||||
customers = pl.DataFrame({
|
||||
"customer_id": [1, 2, 3, 4],
|
||||
"name": ["Alice", "Bob", "Charlie", "David"]
|
||||
})
|
||||
|
||||
orders = pl.DataFrame({
|
||||
"order_id": [101, 102, 103],
|
||||
"customer_id": [1, 2, 1],
|
||||
"amount": [100, 200, 150]
|
||||
})
|
||||
|
||||
# Inner join - only customers with orders
|
||||
result = customers.join(orders, on="customer_id", how="inner")
|
||||
|
||||
# Left join - all customers, even without orders
|
||||
result = customers.join(orders, on="customer_id", how="left")
|
||||
```
|
||||
|
||||
**Example 2: Time-series data**
|
||||
```python
|
||||
prices = pl.DataFrame({
|
||||
"date": ["2023-01-01", "2023-01-02", "2023-01-03"],
|
||||
"stock": ["AAPL", "AAPL", "AAPL"],
|
||||
"price": [150, 152, 151]
|
||||
})
|
||||
|
||||
volumes = pl.DataFrame({
|
||||
"date": ["2023-01-01", "2023-01-02"],
|
||||
"stock": ["AAPL", "AAPL"],
|
||||
"volume": [1000000, 1100000]
|
||||
})
|
||||
|
||||
result = prices.join(
|
||||
volumes,
|
||||
on=["date", "stock"],
|
||||
how="left"
|
||||
)
|
||||
```
|
||||
|
||||
### Asof Joins (Nearest Match)
|
||||
|
||||
For time-series data, join to nearest timestamp:
|
||||
|
||||
```python
|
||||
# Join to nearest earlier timestamp
|
||||
quotes = pl.DataFrame({
|
||||
"timestamp": [1, 2, 3, 4, 5],
|
||||
"stock": ["A", "A", "A", "A", "A"],
|
||||
"quote": [100, 101, 102, 103, 104]
|
||||
})
|
||||
|
||||
trades = pl.DataFrame({
|
||||
"timestamp": [1.5, 3.5, 4.2],
|
||||
"stock": ["A", "A", "A"],
|
||||
"trade": [50, 75, 100]
|
||||
})
|
||||
|
||||
result = trades.join_asof(
|
||||
quotes,
|
||||
on="timestamp",
|
||||
by="stock",
|
||||
strategy="backward" # or "forward", "nearest"
|
||||
)
|
||||
```
|
||||
|
||||
## Concatenation
|
||||
|
||||
Concatenation stacks DataFrames together.
|
||||
|
||||
### Vertical Concatenation (Stack Rows)
|
||||
|
||||
```python
|
||||
df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
|
||||
df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})
|
||||
|
||||
# Stack rows
|
||||
result = pl.concat([df1, df2], how="vertical")
|
||||
# Result: 4 rows, same columns
|
||||
```
|
||||
|
||||
**Handling mismatched schemas:**
|
||||
```python
|
||||
df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
|
||||
df2 = pl.DataFrame({"a": [5, 6], "c": [7, 8]})
|
||||
|
||||
# Diagonal concat - fills missing columns with nulls
|
||||
result = pl.concat([df1, df2], how="diagonal")
|
||||
# Result: columns a, b, c (with nulls where not present)
|
||||
```
|
||||
|
||||
### Horizontal Concatenation (Stack Columns)
|
||||
|
||||
```python
|
||||
df1 = pl.DataFrame({"a": [1, 2, 3]})
|
||||
df2 = pl.DataFrame({"b": [4, 5, 6]})
|
||||
|
||||
# Stack columns
|
||||
result = pl.concat([df1, df2], how="horizontal")
|
||||
# Result: 3 rows, columns a and b
|
||||
```
|
||||
|
||||
**Note:** Horizontal concat requires same number of rows.
|
||||
|
||||
### Concatenation Options
|
||||
|
||||
```python
|
||||
# Rechunk after concatenation (better performance for subsequent operations)
|
||||
result = pl.concat([df1, df2], rechunk=True)
|
||||
|
||||
# Parallel execution
|
||||
result = pl.concat([df1, df2], parallel=True)
|
||||
```
|
||||
|
||||
### Use Cases
|
||||
|
||||
**Combining data from multiple sources:**
|
||||
```python
|
||||
# Read multiple files and concatenate
|
||||
files = ["data_2023.csv", "data_2024.csv", "data_2025.csv"]
|
||||
dfs = [pl.read_csv(f) for f in files]
|
||||
combined = pl.concat(dfs, how="vertical")
|
||||
```
|
||||
|
||||
**Adding computed columns:**
|
||||
```python
|
||||
base = pl.DataFrame({"value": [1, 2, 3]})
|
||||
computed = pl.DataFrame({"doubled": [2, 4, 6]})
|
||||
result = pl.concat([base, computed], how="horizontal")
|
||||
```
|
||||
|
||||
## Pivoting (Wide Format)
|
||||
|
||||
Convert unique values from one column into multiple columns.
|
||||
|
||||
### Basic Pivot
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"date": ["2023-01", "2023-01", "2023-02", "2023-02"],
|
||||
"product": ["A", "B", "A", "B"],
|
||||
"sales": [100, 150, 120, 160]
|
||||
})
|
||||
|
||||
# Pivot: products become columns
|
||||
pivoted = df.pivot(
|
||||
values="sales",
|
||||
index="date",
|
||||
columns="product"
|
||||
)
|
||||
# Result:
|
||||
# date | A | B
|
||||
# 2023-01 | 100 | 150
|
||||
# 2023-02 | 120 | 160
|
||||
```
|
||||
|
||||
### Pivot with Aggregation
|
||||
|
||||
When there are duplicate combinations, aggregate:
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"date": ["2023-01", "2023-01", "2023-01"],
|
||||
"product": ["A", "A", "B"],
|
||||
"sales": [100, 110, 150]
|
||||
})
|
||||
|
||||
# Aggregate duplicates
|
||||
pivoted = df.pivot(
|
||||
values="sales",
|
||||
index="date",
|
||||
columns="product",
|
||||
aggregate_function="sum" # or "mean", "max", "min", etc.
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple Index Columns
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"region": ["North", "North", "South", "South"],
|
||||
"date": ["2023-01", "2023-01", "2023-01", "2023-01"],
|
||||
"product": ["A", "B", "A", "B"],
|
||||
"sales": [100, 150, 120, 160]
|
||||
})
|
||||
|
||||
pivoted = df.pivot(
|
||||
values="sales",
|
||||
index=["region", "date"],
|
||||
columns="product"
|
||||
)
|
||||
```
|
||||
|
||||
## Unpivoting/Melting (Long Format)
|
||||
|
||||
Convert multiple columns into rows (opposite of pivot).
|
||||
|
||||
### Basic Unpivot
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"date": ["2023-01", "2023-02"],
|
||||
"product_A": [100, 120],
|
||||
"product_B": [150, 160]
|
||||
})
|
||||
|
||||
# Unpivot: convert columns to rows
|
||||
unpivoted = df.unpivot(
|
||||
index="date",
|
||||
on=["product_A", "product_B"]
|
||||
)
|
||||
# Result:
|
||||
# date | variable | value
|
||||
# 2023-01 | product_A | 100
|
||||
# 2023-01 | product_B | 150
|
||||
# 2023-02 | product_A | 120
|
||||
# 2023-02 | product_B | 160
|
||||
```
|
||||
|
||||
### Custom Column Names
|
||||
|
||||
```python
|
||||
unpivoted = df.unpivot(
|
||||
index="date",
|
||||
on=["product_A", "product_B"],
|
||||
variable_name="product",
|
||||
value_name="sales"
|
||||
)
|
||||
```
|
||||
|
||||
### Unpivot by Pattern
|
||||
|
||||
```python
|
||||
# Unpivot all columns matching pattern
|
||||
df = pl.DataFrame({
|
||||
"id": [1, 2],
|
||||
"sales_Q1": [100, 200],
|
||||
"sales_Q2": [150, 250],
|
||||
"sales_Q3": [120, 220],
|
||||
"revenue_Q1": [1000, 2000]
|
||||
})
|
||||
|
||||
# Unpivot all sales columns
|
||||
unpivoted = df.unpivot(
|
||||
index="id",
|
||||
on=pl.col("^sales_.*$")
|
||||
)
|
||||
```
|
||||
|
||||
## Exploding (Unnesting Lists)
|
||||
|
||||
Convert list columns into multiple rows.
|
||||
|
||||
### Basic Explode
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"id": [1, 2],
|
||||
"values": [[1, 2, 3], [4, 5]]
|
||||
})
|
||||
|
||||
# Explode list into rows
|
||||
exploded = df.explode("values")
|
||||
# Result:
|
||||
# id | values
|
||||
# 1 | 1
|
||||
# 1 | 2
|
||||
# 1 | 3
|
||||
# 2 | 4
|
||||
# 2 | 5
|
||||
```
|
||||
|
||||
### Multiple Column Explode
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"id": [1, 2],
|
||||
"letters": [["a", "b"], ["c", "d"]],
|
||||
"numbers": [[1, 2], [3, 4]]
|
||||
})
|
||||
|
||||
# Explode multiple columns (must be same length)
|
||||
exploded = df.explode("letters", "numbers")
|
||||
```
|
||||
|
||||
## Transposing
|
||||
|
||||
Swap rows and columns:
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"metric": ["sales", "costs", "profit"],
|
||||
"Q1": [100, 60, 40],
|
||||
"Q2": [150, 80, 70]
|
||||
})
|
||||
|
||||
# Transpose
|
||||
transposed = df.transpose(
|
||||
include_header=True,
|
||||
header_name="quarter",
|
||||
column_names="metric"
|
||||
)
|
||||
# Result: quarters as rows, metrics as columns
|
||||
```
|
||||
|
||||
## Reshaping Patterns
|
||||
|
||||
### Pattern 1: Wide to Long to Wide
|
||||
|
||||
```python
|
||||
# Start wide
|
||||
wide = pl.DataFrame({
|
||||
"id": [1, 2],
|
||||
"A": [10, 20],
|
||||
"B": [30, 40]
|
||||
})
|
||||
|
||||
# To long
|
||||
long = wide.unpivot(index="id", on=["A", "B"])
|
||||
|
||||
# Back to wide (maybe with transformations)
|
||||
wide_again = long.pivot(values="value", index="id", columns="variable")
|
||||
```
|
||||
|
||||
### Pattern 2: Nested to Flat
|
||||
|
||||
```python
|
||||
# Nested data
|
||||
df = pl.DataFrame({
|
||||
"user": [1, 2],
|
||||
"purchases": [
|
||||
[{"item": "A", "qty": 2}, {"item": "B", "qty": 1}],
|
||||
[{"item": "C", "qty": 3}]
|
||||
]
|
||||
})
|
||||
|
||||
# Explode and unnest
|
||||
flat = (
|
||||
df.explode("purchases")
|
||||
.unnest("purchases")
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 3: Aggregation to Pivot
|
||||
|
||||
```python
|
||||
# Raw data
|
||||
sales = pl.DataFrame({
|
||||
"date": ["2023-01", "2023-01", "2023-02"],
|
||||
"product": ["A", "B", "A"],
|
||||
"sales": [100, 150, 120]
|
||||
})
|
||||
|
||||
# Aggregate then pivot
|
||||
result = (
|
||||
sales
|
||||
.group_by("date", "product")
|
||||
.agg(pl.col("sales").sum())
|
||||
.pivot(values="sales", index="date", columns="product")
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Transformations
|
||||
|
||||
### Conditional Reshaping
|
||||
|
||||
```python
|
||||
# Pivot only certain values
|
||||
df.filter(pl.col("year") >= 2020).pivot(...)
|
||||
|
||||
# Unpivot with filtering
|
||||
df.unpivot(index="id", on=pl.col("^sales.*$"))
|
||||
```
|
||||
|
||||
### Multi-level Transformations
|
||||
|
||||
```python
|
||||
# Complex reshaping pipeline
|
||||
result = (
|
||||
df
|
||||
.unpivot(index="id", on=pl.col("^Q[0-9]_.*$"))
|
||||
.with_columns(
|
||||
quarter=pl.col("variable").str.extract(r"Q([0-9])", 1),
|
||||
metric=pl.col("variable").str.extract(r"Q[0-9]_(.*)", 1)
|
||||
)
|
||||
.drop("variable")
|
||||
.pivot(values="value", index=["id", "quarter"], columns="metric")
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Join Performance
|
||||
|
||||
```python
|
||||
# 1. Join on indexed/sorted columns when possible
|
||||
df1_sorted = df1.sort("id")
|
||||
df2_sorted = df2.sort("id")
|
||||
result = df1_sorted.join(df2_sorted, on="id")
|
||||
|
||||
# 2. Use appropriate join type
|
||||
# semi/anti are faster than inner+filter
|
||||
matches = df1.join(df2, on="id", how="semi") # Better than filtering after inner join
|
||||
|
||||
# 3. Filter before joining
|
||||
df1_filtered = df1.filter(pl.col("active"))
|
||||
result = df1_filtered.join(df2, on="id") # Smaller join
|
||||
```
|
||||
|
||||
### Concatenation Performance
|
||||
|
||||
```python
|
||||
# 1. Rechunk after concatenation
|
||||
result = pl.concat(dfs, rechunk=True)
|
||||
|
||||
# 2. Use lazy mode for large concatenations
|
||||
lf1 = pl.scan_parquet("file1.parquet")
|
||||
lf2 = pl.scan_parquet("file2.parquet")
|
||||
result = pl.concat([lf1, lf2]).collect()
|
||||
```
|
||||
|
||||
### Pivot Performance
|
||||
|
||||
```python
|
||||
# 1. Filter before pivoting
|
||||
pivoted = df.filter(pl.col("year") == 2023).pivot(...)
|
||||
|
||||
# 2. Specify aggregate function explicitly
|
||||
pivoted = df.pivot(..., aggregate_function="first") # Faster than "sum" if only one value
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Time Series Alignment
|
||||
|
||||
```python
|
||||
# Align two time series with different timestamps
|
||||
ts1.join_asof(ts2, on="timestamp", strategy="backward")
|
||||
```
|
||||
|
||||
### Feature Engineering
|
||||
|
||||
```python
|
||||
# Create lag features
|
||||
df.with_columns(
|
||||
pl.col("value").shift(1).over("user_id").alias("prev_value"),
|
||||
pl.col("value").shift(2).over("user_id").alias("prev_prev_value")
|
||||
)
|
||||
```
|
||||
|
||||
### Data Denormalization
|
||||
|
||||
```python
|
||||
# Combine normalized tables
|
||||
orders.join(customers, on="customer_id").join(products, on="product_id")
|
||||
```
|
||||
|
||||
### Report Generation
|
||||
|
||||
```python
|
||||
# Pivot for reporting
|
||||
sales.pivot(values="amount", index="month", columns="product")
|
||||
```
|
||||
Reference in New Issue
Block a user