Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/polars/SKILL.md
+++ b/skills/polars/SKILL.md
@@ -0,0 +1,381 @@
+---
+name: polars
+description: "Fast DataFrame library (Apache Arrow). Select, filter, group_by, joins, lazy evaluation, CSV/Parquet I/O, expression API, for high-performance data analysis workflows."
+---
+
+# Polars
+
+## Overview
+
+Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
+
+## Quick Start
+
+### Installation and Basic Usage
+
+Install Polars:
+```python
+uv pip install polars
+```
+
+Basic DataFrame creation and operations:
+```python
+import polars as pl
+
+# Create DataFrame
+df = pl.DataFrame({
+    "name": ["Alice", "Bob", "Charlie"],
+    "age": [25, 30, 35],
+    "city": ["NY", "LA", "SF"]
+})
+
+# Select columns
+df.select("name", "age")
+
+# Filter rows
+df.filter(pl.col("age") > 25)
+
+# Add computed columns
+df.with_columns(
+    age_plus_10=pl.col("age") + 10
+)
+```
+
+## Core Concepts
+
+### Expressions
+
+Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
+
+**Key principles:**
+- Use `pl.col("column_name")` to reference columns
+- Chain methods to build complex transformations
+- Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
+
+**Example:**
+```python
+# Expression-based computation
+df.select(
+    pl.col("name"),
+    (pl.col("age") * 12).alias("age_in_months")
+)
+```
+
+### Lazy vs Eager Evaluation
+
+**Eager (DataFrame):** Operations execute immediately
+```python
+df = pl.read_csv("file.csv")  # Reads immediately
+result = df.filter(pl.col("age") > 25)  # Executes immediately
+```
+
+**Lazy (LazyFrame):** Operations build a query plan, optimized before execution
+```python
+lf = pl.scan_csv("file.csv")  # Doesn't read yet
+result = lf.filter(pl.col("age") > 25).select("name", "age")
+df = result.collect()  # Now executes optimized query
+```
+
+**When to use lazy:**
+- Working with large datasets
+- Complex query pipelines
+- When only some columns/rows are needed
+- Performance is critical
+
+**Benefits of lazy evaluation:**
+- Automatic query optimization
+- Predicate pushdown
+- Projection pushdown
+- Parallel execution
+
+For detailed concepts, load `references/core_concepts.md`.
+
+## Common Operations
+
+### Select
+Select and manipulate columns:
+```python
+# Select specific columns
+df.select("name", "age")
+
+# Select with expressions
+df.select(
+    pl.col("name"),
+    (pl.col("age") * 2).alias("double_age")
+)
+
+# Select all columns matching a pattern
+df.select(pl.col("^.*_id$"))
+```
+
+### Filter
+Filter rows by conditions:
+```python
+# Single condition
+df.filter(pl.col("age") > 25)
+
+# Multiple conditions (cleaner than using &)
+df.filter(
+    pl.col("age") > 25,
+    pl.col("city") == "NY"
+)
+
+# Complex conditions
+df.filter(
+    (pl.col("age") > 25) | (pl.col("city") == "LA")
+)
+```
+
+### With Columns
+Add or modify columns while preserving existing ones:
+```python
+# Add new columns
+df.with_columns(
+    age_plus_10=pl.col("age") + 10,
+    name_upper=pl.col("name").str.to_uppercase()
+)
+
+# Parallel computation (all columns computed in parallel)
+df.with_columns(
+    pl.col("value") * 10,
+    pl.col("value") * 100,
+)
+```
+
+### Group By and Aggregations
+Group data and compute aggregations:
+```python
+# Basic grouping
+df.group_by("city").agg(
+    pl.col("age").mean().alias("avg_age"),
+    pl.len().alias("count")
+)
+
+# Multiple group keys
+df.group_by("city", "department").agg(
+    pl.col("salary").sum()
+)
+
+# Conditional aggregations
+df.group_by("city").agg(
+    (pl.col("age") > 30).sum().alias("over_30")
+)
+```
+
+For detailed operation patterns, load `references/operations.md`.
+
+## Aggregations and Window Functions
+
+### Aggregation Functions
+Common aggregations within `group_by` context:
+- `pl.len()` - count rows
+- `pl.col("x").sum()` - sum values
+- `pl.col("x").mean()` - average
+- `pl.col("x").min()` / `pl.col("x").max()` - extremes
+- `pl.first()` / `pl.last()` - first/last values
+
+### Window Functions with `over()`
+Apply aggregations while preserving row count:
+```python
+# Add group statistics to each row
+df.with_columns(
+    avg_age_by_city=pl.col("age").mean().over("city"),
+    rank_in_city=pl.col("salary").rank().over("city")
+)
+
+# Multiple grouping columns
+df.with_columns(
+    group_avg=pl.col("value").mean().over("category", "region")
+)
+```
+
+**Mapping strategies:**
+- `group_to_rows` (default): Preserves original row order
+- `explode`: Faster but groups rows together
+- `join`: Creates list columns
+
+## Data I/O
+
+### Supported Formats
+Polars supports reading and writing:
+- CSV, Parquet, JSON, Excel
+- Databases (via connectors)
+- Cloud storage (S3, Azure, GCS)
+- Google BigQuery
+- Multiple/partitioned files
+
+### Common I/O Operations
+
+**CSV:**
+```python
+# Eager
+df = pl.read_csv("file.csv")
+df.write_csv("output.csv")
+
+# Lazy (preferred for large files)
+lf = pl.scan_csv("file.csv")
+result = lf.filter(...).select(...).collect()
+```
+
+**Parquet (recommended for performance):**
+```python
+df = pl.read_parquet("file.parquet")
+df.write_parquet("output.parquet")
+```
+
+**JSON:**
+```python
+df = pl.read_json("file.json")
+df.write_json("output.json")
+```
+
+For comprehensive I/O documentation, load `references/io_guide.md`.
+
+## Transformations
+
+### Joins
+Combine DataFrames:
+```python
+# Inner join
+df1.join(df2, on="id", how="inner")
+
+# Left join
+df1.join(df2, on="id", how="left")
+
+# Join on different column names
+df1.join(df2, left_on="user_id", right_on="id")
+```
+
+### Concatenation
+Stack DataFrames:
+```python
+# Vertical (stack rows)
+pl.concat([df1, df2], how="vertical")
+
+# Horizontal (add columns)
+pl.concat([df1, df2], how="horizontal")
+
+# Diagonal (union with different schemas)
+pl.concat([df1, df2], how="diagonal")
+```
+
+### Pivot and Unpivot
+Reshape data:
+```python
+# Pivot (wide format)
+df.pivot(values="sales", index="date", columns="product")
+
+# Unpivot (long format)
+df.unpivot(index="id", on=["col1", "col2"])
+```
+
+For detailed transformation examples, load `references/transformations.md`.
+
+## Pandas Migration
+
+Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
+
+### Conceptual Differences
+- **No index**: Polars uses integer positions only
+- **Strict typing**: No silent type conversions
+- **Lazy evaluation**: Available via LazyFrame
+- **Parallel by default**: Operations parallelized automatically
+
+### Common Operation Mappings
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Select column | `df["col"]` | `df.select("col")` |
+| Filter | `df[df["col"] > 10]` | `df.filter(pl.col("col") > 10)` |
+| Add column | `df.assign(x=...)` | `df.with_columns(x=...)` |
+| Group by | `df.groupby("col").agg(...)` | `df.group_by("col").agg(...)` |
+| Window | `df.groupby("col").transform(...)` | `df.with_columns(...).over("col")` |
+
+### Key Syntax Patterns
+
+**Pandas sequential (slow):**
+```python
+df.assign(
+    col_a=lambda df_: df_.value * 10,
+    col_b=lambda df_: df_.value * 100
+)
+```
+
+**Polars parallel (fast):**
+```python
+df.with_columns(
+    col_a=pl.col("value") * 10,
+    col_b=pl.col("value") * 100,
+)
+```
+
+For comprehensive migration guide, load `references/pandas_migration.md`.
+
+## Best Practices
+
+### Performance Optimization
+
+1. **Use lazy evaluation for large datasets:**
+   ```python
+   lf = pl.scan_csv("large.csv")  # Don't use read_csv
+   result = lf.filter(...).select(...).collect()
+   ```
+
+2. **Avoid Python functions in hot paths:**
+   - Stay within expression API for parallelization
+   - Use `.map_elements()` only when necessary
+   - Prefer native Polars operations
+
+3. **Use streaming for very large data:**
+   ```python
+   lf.collect(streaming=True)
+   ```
+
+4. **Select only needed columns early:**
+   ```python
+   # Good: Select columns early
+   lf.select("col1", "col2").filter(...)
+
+   # Bad: Filter on all columns first
+   lf.filter(...).select("col1", "col2")
+   ```
+
+5. **Use appropriate data types:**
+   - Categorical for low-cardinality strings
+   - Appropriate integer sizes (i32 vs i64)
+   - Date types for temporal data
+
+### Expression Patterns
+
+**Conditional operations:**
+```python
+pl.when(condition).then(value).otherwise(other_value)
+```
+
+**Column operations across multiple columns:**
+```python
+df.select(pl.col("^.*_value$") * 2)  # Regex pattern
+```
+
+**Null handling:**
+```python
+pl.col("x").fill_null(0)
+pl.col("x").is_null()
+pl.col("x").drop_nulls()
+```
+
+For additional best practices and patterns, load `references/best_practices.md`.
+
+## Resources
+
+This skill includes comprehensive reference documentation:
+
+### references/
+- `core_concepts.md` - Detailed explanations of expressions, lazy evaluation, and type system
+- `operations.md` - Comprehensive guide to all common operations with examples
+- `pandas_migration.md` - Complete migration guide from pandas to Polars
+- `io_guide.md` - Data I/O operations for all supported formats
+- `transformations.md` - Joins, concatenation, pivots, and reshaping operations
+- `best_practices.md` - Performance optimization tips and common patterns
+
+Load these references as needed when users require detailed information about specific topics.
--- a/skills/polars/references/best_practices.md
+++ b/skills/polars/references/best_practices.md
@@ -0,0 +1,649 @@
+# Polars Best Practices and Performance Guide
+
+Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.
+
+## Performance Optimization
+
+### 1. Use Lazy Evaluation
+
+**Always prefer lazy mode for large datasets:**
+
+```python
+# Bad: Eager mode loads everything immediately
+df = pl.read_csv("large_file.csv")
+result = df.filter(pl.col("age") > 25).select("name", "age")
+
+# Good: Lazy mode optimizes before execution
+lf = pl.scan_csv("large_file.csv")
+result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
+```
+
+**Benefits of lazy evaluation:**
+- Predicate pushdown (filter at source)
+- Projection pushdown (read only needed columns)
+- Query optimization
+- Parallel execution planning
+
+### 2. Filter and Select Early
+
+Push filters and column selection as early as possible in the pipeline:
+
+```python
+# Bad: Process all data, then filter and select
+result = (
+    lf.group_by("category")
+    .agg(pl.col("value").mean())
+    .join(other, on="category")
+    .filter(pl.col("value") > 100)
+    .select("category", "value")
+)
+
+# Good: Filter and select early
+result = (
+    lf.select("category", "value")  # Only needed columns
+    .filter(pl.col("value") > 100)  # Filter early
+    .group_by("category")
+    .agg(pl.col("value").mean())
+    .join(other.select("category", "other_col"), on="category")
+)
+```
+
+### 3. Avoid Python Functions
+
+Stay within the expression API to maintain parallelization:
+
+```python
+# Bad: Python function disables parallelization
+df = df.with_columns(
+    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
+)
+
+# Good: Use native expressions (parallelized)
+df = df.with_columns(result=pl.col("value") * 2)
+```
+
+**When you must use custom functions:**
+```python
+# If truly needed, be explicit
+df = df.with_columns(
+    result=pl.col("value").map_elements(
+        custom_function,
+        return_dtype=pl.Float64,
+        skip_nulls=True  # Optimize null handling
+    )
+)
+```
+
+### 4. Use Streaming for Very Large Data
+
+Enable streaming for datasets larger than RAM:
+
+```python
+# Streaming mode processes data in chunks
+lf = pl.scan_parquet("very_large.parquet")
+result = lf.filter(pl.col("value") > 100).collect(streaming=True)
+
+# Or use sink for direct streaming writes
+lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")
+```
+
+### 5. Optimize Data Types
+
+Choose appropriate data types to reduce memory and improve performance:
+
+```python
+# Bad: Default types may be wasteful
+df = pl.read_csv("data.csv")
+
+# Good: Specify optimal types
+df = pl.read_csv(
+    "data.csv",
+    dtypes={
+        "id": pl.UInt32,  # Instead of Int64 if values fit
+        "category": pl.Categorical,  # For low-cardinality strings
+        "date": pl.Date,  # Instead of String
+        "small_int": pl.Int16,  # Instead of Int64
+    }
+)
+```
+
+**Type optimization guidelines:**
+- Use smallest integer type that fits your data
+- Use `Categorical` for strings with low cardinality (<50% unique)
+- Use `Date` instead of `Datetime` when time isn't needed
+- Use `Boolean` instead of integers for binary flags
+
+### 6. Parallel Operations
+
+Structure code to maximize parallelization:
+
+```python
+# Bad: Sequential pipe operations disable parallelization
+df = (
+    df.pipe(operation1)
+    .pipe(operation2)
+    .pipe(operation3)
+)
+
+# Good: Combined operations enable parallelization
+df = df.with_columns(
+    result1=operation1_expr(),
+    result2=operation2_expr(),
+    result3=operation3_expr()
+)
+```
+
+### 7. Rechunk After Concatenation
+
+```python
+# Concatenation can fragment data
+combined = pl.concat([df1, df2, df3])
+
+# Rechunk for better performance in subsequent operations
+combined = pl.concat([df1, df2, df3], rechunk=True)
+```
+
+## Expression Patterns
+
+### Conditional Logic
+
+**Simple conditions:**
+```python
+df.with_columns(
+    status=pl.when(pl.col("age") >= 18)
+        .then("adult")
+        .otherwise("minor")
+)
+```
+
+**Multiple conditions:**
+```python
+df.with_columns(
+    grade=pl.when(pl.col("score") >= 90)
+        .then("A")
+        .when(pl.col("score") >= 80)
+        .then("B")
+        .when(pl.col("score") >= 70)
+        .then("C")
+        .when(pl.col("score") >= 60)
+        .then("D")
+        .otherwise("F")
+)
+```
+
+**Complex conditions:**
+```python
+df.with_columns(
+    category=pl.when(
+        (pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
+    )
+    .then("enterprise")
+    .when(
+        (pl.col("revenue") > 100000) | (pl.col("customers") > 50)
+    )
+    .then("business")
+    .otherwise("starter")
+)
+```
+
+### Null Handling
+
+**Check for nulls:**
+```python
+df.filter(pl.col("value").is_null())
+df.filter(pl.col("value").is_not_null())
+```
+
+**Fill nulls:**
+```python
+# Constant value
+df.with_columns(pl.col("value").fill_null(0))
+
+# Forward fill
+df.with_columns(pl.col("value").fill_null(strategy="forward"))
+
+# Backward fill
+df.with_columns(pl.col("value").fill_null(strategy="backward"))
+
+# Mean
+df.with_columns(pl.col("value").fill_null(strategy="mean"))
+
+# Per-group fill
+df.with_columns(
+    pl.col("value").fill_null(pl.col("value").mean()).over("group")
+)
+```
+
+**Coalesce (first non-null):**
+```python
+df.with_columns(
+    combined=pl.coalesce(["col1", "col2", "col3"])
+)
+```
+
+### Column Selection Patterns
+
+**By name:**
+```python
+df.select("col1", "col2", "col3")
+```
+
+**By pattern:**
+```python
+# Regex
+df.select(pl.col("^sales_.*$"))
+
+# Starts with
+df.select(pl.col("^sales"))
+
+# Ends with
+df.select(pl.col("_total$"))
+
+# Contains
+df.select(pl.col(".*revenue.*"))
+```
+
+**By type:**
+```python
+# All numeric columns
+df.select(pl.col(pl.NUMERIC_DTYPES))
+
+# All string columns
+df.select(pl.col(pl.Utf8))
+
+# Multiple types
+df.select(pl.col(pl.NUMERIC_DTYPES, pl.Boolean))
+```
+
+**Exclude columns:**
+```python
+df.select(pl.all().exclude("id", "timestamp"))
+```
+
+**Transform multiple columns:**
+```python
+# Apply same operation to multiple columns
+df.select(
+    pl.col("^sales_.*$") * 1.1  # 10% increase to all sales columns
+)
+```
+
+### Aggregation Patterns
+
+**Multiple aggregations:**
+```python
+df.group_by("category").agg(
+    pl.col("value").sum().alias("total"),
+    pl.col("value").mean().alias("average"),
+    pl.col("value").std().alias("std_dev"),
+    pl.col("id").count().alias("count"),
+    pl.col("id").n_unique().alias("unique_count"),
+    pl.col("value").min().alias("minimum"),
+    pl.col("value").max().alias("maximum"),
+    pl.col("value").quantile(0.5).alias("median"),
+    pl.col("value").quantile(0.95).alias("p95")
+)
+```
+
+**Conditional aggregations:**
+```python
+df.group_by("category").agg(
+    # Count high values
+    (pl.col("value") > 100).sum().alias("high_count"),
+
+    # Average of filtered values
+    pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),
+
+    # Conditional sum
+    pl.when(pl.col("status") == "completed")
+        .then(pl.col("amount"))
+        .otherwise(0)
+        .sum()
+        .alias("completed_total")
+)
+```
+
+**Grouped transformations:**
+```python
+df.with_columns(
+    # Group statistics
+    group_mean=pl.col("value").mean().over("category"),
+    group_std=pl.col("value").std().over("category"),
+
+    # Rank within groups
+    rank=pl.col("value").rank().over("category"),
+
+    # Percentage of group total
+    pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
+)
+```
+
+## Common Pitfalls and Anti-Patterns
+
+### Pitfall 1: Row Iteration
+
+```python
+# Bad: Never iterate rows
+for row in df.iter_rows():
+    # Process row
+    result = row[0] * 2
+
+# Good: Use vectorized operations
+df = df.with_columns(result=pl.col("value") * 2)
+```
+
+### Pitfall 2: Modifying in Place
+
+```python
+# Bad: Polars is immutable, this doesn't work as expected
+df["new_col"] = df["old_col"] * 2  # May work but not recommended
+
+# Good: Functional style
+df = df.with_columns(new_col=pl.col("old_col") * 2)
+```
+
+### Pitfall 3: Not Using Expressions
+
+```python
+# Bad: String-based operations
+df.select("value * 2")  # Won't work
+
+# Good: Expression-based
+df.select(pl.col("value") * 2)
+```
+
+### Pitfall 4: Inefficient Joins
+
+```python
+# Bad: Join large tables without filtering
+result = large_df1.join(large_df2, on="id")
+
+# Good: Filter before joining
+result = (
+    large_df1.filter(pl.col("active"))
+    .join(
+        large_df2.filter(pl.col("status") == "valid"),
+        on="id"
+    )
+)
+```
+
+### Pitfall 5: Not Specifying Types
+
+```python
+# Bad: Let Polars infer everything
+df = pl.read_csv("data.csv")
+
+# Good: Specify types for correctness and performance
+df = pl.read_csv(
+    "data.csv",
+    dtypes={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
+)
+```
+
+### Pitfall 6: Creating Many Small DataFrames
+
+```python
+# Bad: Many operations creating intermediate DataFrames
+df1 = df.filter(pl.col("age") > 25)
+df2 = df1.select("name", "age")
+df3 = df2.sort("age")
+result = df3.head(10)
+
+# Good: Chain operations
+result = (
+    df.filter(pl.col("age") > 25)
+    .select("name", "age")
+    .sort("age")
+    .head(10)
+)
+
+# Better: Use lazy mode
+result = (
+    df.lazy()
+    .filter(pl.col("age") > 25)
+    .select("name", "age")
+    .sort("age")
+    .head(10)
+    .collect()
+)
+```
+
+## Memory Management
+
+### Monitor Memory Usage
+
+```python
+# Check DataFrame size
+print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")
+
+# Profile memory during operations
+lf = pl.scan_csv("large.csv")
+print(lf.explain())  # See query plan
+```
+
+### Reduce Memory Footprint
+
+```python
+# 1. Use lazy mode
+lf = pl.scan_parquet("data.parquet")
+
+# 2. Stream results
+result = lf.collect(streaming=True)
+
+# 3. Select only needed columns
+lf = lf.select("col1", "col2")
+
+# 4. Optimize data types
+df = df.with_columns(
+    pl.col("int_col").cast(pl.Int32),  # Downcast if possible
+    pl.col("category").cast(pl.Categorical)  # For low cardinality
+)
+
+# 5. Drop columns not needed
+df = df.drop("large_text_col", "unused_col")
+```
+
+## Testing and Debugging
+
+### Inspect Query Plans
+
+```python
+lf = pl.scan_csv("data.csv")
+query = lf.filter(pl.col("age") > 25).select("name", "age")
+
+# View the optimized query plan
+print(query.explain())
+
+# View detailed query plan
+print(query.explain(optimized=True))
+```
+
+### Sample Data for Development
+
+```python
+# Use n_rows for testing
+df = pl.read_csv("large.csv", n_rows=1000)
+
+# Or sample after reading
+df_sample = df.sample(n=1000, seed=42)
+```
+
+### Validate Schemas
+
+```python
+# Check schema
+print(df.schema)
+
+# Ensure schema matches expectation
+expected_schema = {
+    "id": pl.Int64,
+    "name": pl.Utf8,
+    "date": pl.Date
+}
+
+assert df.schema == expected_schema
+```
+
+### Profile Performance
+
+```python
+import time
+
+# Time operations
+start = time.time()
+result = lf.collect()
+print(f"Execution time: {time.time() - start:.2f}s")
+
+# Compare eager vs lazy
+start = time.time()
+df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
+eager_time = time.time() - start
+
+start = time.time()
+df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
+lazy_time = time.time() - start
+
+print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")
+```
+
+## File Format Best Practices
+
+### Choose the Right Format
+
+**Parquet:**
+- Best for: Large datasets, archival, data lakes
+- Pros: Excellent compression, columnar, fast reads
+- Cons: Not human-readable
+
+**CSV:**
+- Best for: Small datasets, human inspection, legacy systems
+- Pros: Universal, human-readable
+- Cons: Slow, large file size, no type preservation
+
+**Arrow IPC:**
+- Best for: Inter-process communication, temporary storage
+- Pros: Fastest, zero-copy, preserves all types
+- Cons: Less compression than Parquet
+
+### File Reading Best Practices
+
+```python
+# 1. Use lazy reading
+lf = pl.scan_parquet("data.parquet")  # Not read_parquet
+
+# 2. Read multiple files efficiently
+lf = pl.scan_parquet("data/*.parquet")  # Parallel reading
+
+# 3. Specify schema when known
+lf = pl.scan_csv(
+    "data.csv",
+    dtypes={"id": pl.Int64, "date": pl.Date}
+)
+
+# 4. Use predicate pushdown
+result = lf.filter(pl.col("date") >= "2023-01-01").collect()
+```
+
+### File Writing Best Practices
+
+```python
+# 1. Use Parquet for large data
+df.write_parquet("output.parquet", compression="zstd")
+
+# 2. Partition large datasets
+df.write_parquet("output", partition_by=["year", "month"])
+
+# 3. Use streaming for very large writes
+lf.sink_parquet("output.parquet")  # Streaming write
+
+# 4. Optimize compression
+df.write_parquet(
+    "output.parquet",
+    compression="snappy",  # Fast compression
+    statistics=True  # Enable predicate pushdown on read
+)
+```
+
+## Code Organization
+
+### Reusable Expressions
+
+```python
+# Define reusable expressions
+age_group = (
+    pl.when(pl.col("age") < 18)
+    .then("minor")
+    .when(pl.col("age") < 65)
+    .then("adult")
+    .otherwise("senior")
+)
+
+revenue_per_customer = pl.col("revenue") / pl.col("customer_count")
+
+# Use in multiple contexts
+df = df.with_columns(
+    age_group=age_group,
+    rpc=revenue_per_customer
+)
+
+# Reuse in filtering
+df = df.filter(revenue_per_customer > 100)
+```
+
+### Pipeline Functions
+
+```python
+def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
+    """Clean and standardize data."""
+    return lf.with_columns(
+        pl.col("name").str.to_uppercase(),
+        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
+        pl.col("amount").fill_null(0)
+    )
+
+def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
+    """Add computed features."""
+    return lf.with_columns(
+        month=pl.col("date").dt.month(),
+        year=pl.col("date").dt.year(),
+        amount_log=pl.col("amount").log()
+    )
+
+# Compose pipeline
+result = (
+    pl.scan_csv("data.csv")
+    .pipe(clean_data)
+    .pipe(add_features)
+    .filter(pl.col("year") == 2023)
+    .collect()
+)
+```
+
+## Documentation
+
+Always document complex expressions and transformations:
+
+```python
+# Good: Document intent
+df = df.with_columns(
+    # Calculate customer lifetime value as sum of purchases
+    # divided by months since first purchase
+    clv=(
+        pl.col("total_purchases") /
+        ((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
+         .dt.total_days() / 30)
+    )
+)
+```
+
+## Version Compatibility
+
+```python
+# Check Polars version
+import polars as pl
+print(pl.__version__)
+
+# Feature availability varies by version
+# Document version requirements for production code
+```
--- a/skills/polars/references/core_concepts.md
+++ b/skills/polars/references/core_concepts.md
@@ -0,0 +1,378 @@
+# Polars Core Concepts
+
+## Expressions
+
+Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.
+
+### What are Expressions?
+
+An expression describes a transformation on data. It only materializes (executes) within specific contexts:
+- `select()` - Select and transform columns
+- `with_columns()` - Add or modify columns
+- `filter()` - Filter rows
+- `group_by().agg()` - Aggregate data
+
+### Expression Syntax
+
+**Basic column reference:**
+```python
+pl.col("column_name")
+```
+
+**Computed expressions:**
+```python
+# Arithmetic
+pl.col("height") * 2
+pl.col("price") + pl.col("tax")
+
+# With alias
+(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")
+
+# Method chaining
+pl.col("name").str.to_uppercase().str.slice(0, 3)
+```
+
+### Expression Contexts
+
+**Select context:**
+```python
+df.select(
+    "name",  # Simple column name
+    pl.col("age"),  # Expression
+    (pl.col("age") * 12).alias("age_in_months")  # Computed expression
+)
+```
+
+**With_columns context:**
+```python
+df.with_columns(
+    age_doubled=pl.col("age") * 2,
+    name_upper=pl.col("name").str.to_uppercase()
+)
+```
+
+**Filter context:**
+```python
+df.filter(
+    pl.col("age") > 25,
+    pl.col("city").is_in(["NY", "LA", "SF"])
+)
+```
+
+**Group_by context:**
+```python
+df.group_by("department").agg(
+    pl.col("salary").mean(),
+    pl.col("employee_id").count()
+)
+```
+
+### Expression Expansion
+
+Apply operations to multiple columns at once:
+
+**All columns:**
+```python
+df.select(pl.all() * 2)
+```
+
+**Pattern matching:**
+```python
+# All columns ending with "_value"
+df.select(pl.col("^.*_value$") * 100)
+
+# All numeric columns
+df.select(pl.col(pl.NUMERIC_DTYPES) + 1)
+```
+
+**Exclude patterns:**
+```python
+df.select(pl.all().exclude("id", "name"))
+```
+
+### Expression Composition
+
+Expressions can be stored and reused:
+
+```python
+# Define reusable expressions
+age_expression = pl.col("age") * 12
+name_expression = pl.col("name").str.to_uppercase()
+
+# Use in multiple contexts
+df.select(age_expression, name_expression)
+df.with_columns(age_months=age_expression)
+```
+
+## Data Types
+
+Polars has a strict type system based on Apache Arrow.
+
+### Core Data Types
+
+**Numeric:**
+- `Int8`, `Int16`, `Int32`, `Int64` - Signed integers
+- `UInt8`, `UInt16`, `UInt32`, `UInt64` - Unsigned integers
+- `Float32`, `Float64` - Floating point numbers
+
+**Text:**
+- `Utf8` / `String` - UTF-8 encoded strings
+- `Categorical` - Categorized strings (low cardinality)
+- `Enum` - Fixed set of string values
+
+**Temporal:**
+- `Date` - Calendar date (no time)
+- `Datetime` - Date and time with optional timezone
+- `Time` - Time of day
+- `Duration` - Time duration/difference
+
+**Boolean:**
+- `Boolean` - True/False values
+
+**Nested:**
+- `List` - Variable-length lists
+- `Array` - Fixed-length arrays
+- `Struct` - Nested record structures
+
+**Other:**
+- `Binary` - Binary data
+- `Object` - Python objects (avoid in production)
+- `Null` - Null type
+
+### Type Casting
+
+Convert between types explicitly:
+
+```python
+# Cast to different type
+df.select(
+    pl.col("age").cast(pl.Float64),
+    pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
+    pl.col("id").cast(pl.Utf8)
+)
+```
+
+### Null Handling
+
+Polars uses consistent null handling across all types:
+
+**Check for nulls:**
+```python
+df.filter(pl.col("value").is_null())
+df.filter(pl.col("value").is_not_null())
+```
+
+**Fill nulls:**
+```python
+pl.col("value").fill_null(0)
+pl.col("value").fill_null(strategy="forward")
+pl.col("value").fill_null(strategy="backward")
+pl.col("value").fill_null(strategy="mean")
+```
+
+**Drop nulls:**
+```python
+df.drop_nulls()  # Drop any row with nulls
+df.drop_nulls(subset=["col1", "col2"])  # Drop rows with nulls in specific columns
+```
+
+### Categorical Data
+
+Use categorical types for string columns with low cardinality (repeated values):
+
+```python
+# Cast to categorical
+df.with_columns(
+    pl.col("category").cast(pl.Categorical)
+)
+
+# Benefits:
+# - Reduced memory usage
+# - Faster grouping and joining
+# - Maintains order information
+```
+
+## Lazy vs Eager Evaluation
+
+Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).
+
+### Eager Evaluation (DataFrame)
+
+Operations execute immediately:
+
+```python
+import polars as pl
+
+# DataFrame operations execute right away
+df = pl.read_csv("data.csv")  # Reads file immediately
+result = df.filter(pl.col("age") > 25)  # Filters immediately
+final = result.select("name", "age")  # Selects immediately
+```
+
+**When to use eager:**
+- Small datasets that fit in memory
+- Interactive exploration in notebooks
+- Simple one-off operations
+- Immediate feedback needed
+
+### Lazy Evaluation (LazyFrame)
+
+Operations build a query plan, optimized before execution:
+
+```python
+import polars as pl
+
+# LazyFrame operations build a query plan
+lf = pl.scan_csv("data.csv")  # Doesn't read yet
+lf2 = lf.filter(pl.col("age") > 25)  # Adds to plan
+lf3 = lf2.select("name", "age")  # Adds to plan
+df = lf3.collect()  # NOW executes optimized plan
+```
+
+**When to use lazy:**
+- Large datasets
+- Complex query pipelines
+- Only need subset of data
+- Performance is critical
+- Streaming required
+
+### Query Optimization
+
+Polars automatically optimizes lazy queries:
+
+**Predicate Pushdown:**
+Filter operations pushed to data source when possible:
+```python
+# Only reads rows where age > 25 from CSV
+lf = pl.scan_csv("data.csv")
+result = lf.filter(pl.col("age") > 25).collect()
+```
+
+**Projection Pushdown:**
+Only read needed columns from data source:
+```python
+# Only reads "name" and "age" columns from CSV
+lf = pl.scan_csv("data.csv")
+result = lf.select("name", "age").collect()
+```
+
+**Query Plan Inspection:**
+```python
+# View the optimized query plan
+lf = pl.scan_csv("data.csv")
+result = lf.filter(pl.col("age") > 25).select("name", "age")
+print(result.explain())  # Shows optimized plan
+```
+
+### Streaming Mode
+
+Process data larger than memory:
+
+```python
+# Enable streaming for very large datasets
+lf = pl.scan_csv("very_large.csv")
+result = lf.filter(pl.col("age") > 25).collect(streaming=True)
+```
+
+**Streaming benefits:**
+- Process data larger than RAM
+- Lower peak memory usage
+- Chunk-based processing
+- Automatic memory management
+
+**Streaming limitations:**
+- Not all operations support streaming
+- May be slower for small data
+- Some operations require materializing entire dataset
+
+### Converting Between Eager and Lazy
+
+**Eager to Lazy:**
+```python
+df = pl.read_csv("data.csv")
+lf = df.lazy()  # Convert to LazyFrame
+```
+
+**Lazy to Eager:**
+```python
+lf = pl.scan_csv("data.csv")
+df = lf.collect()  # Execute and return DataFrame
+```
+
+## Memory Format
+
+Polars uses Apache Arrow columnar memory format:
+
+**Benefits:**
+- Zero-copy data sharing with other Arrow libraries
+- Efficient columnar operations
+- SIMD vectorization
+- Reduced memory overhead
+- Fast serialization
+
+**Implications:**
+- Data stored column-wise, not row-wise
+- Column operations very fast
+- Random row access slower than pandas
+- Best for analytical workloads
+
+## Parallelization
+
+Polars parallelizes operations automatically using Rust's concurrency:
+
+**What gets parallelized:**
+- Aggregations within groups
+- Window functions
+- Most expression evaluations
+- File reading (multiple files)
+- Join operations
+
+**What to avoid for parallelization:**
+- Python user-defined functions (UDFs)
+- Lambda functions in `.map_elements()`
+- Sequential `.pipe()` chains
+
+**Best practice:**
+```python
+# Good: Stays in expression API (parallelized)
+df.with_columns(
+    pl.col("value") * 10,
+    pl.col("value").log(),
+    pl.col("value").sqrt()
+)
+
+# Bad: Uses Python function (sequential)
+df.with_columns(
+    pl.col("value").map_elements(lambda x: x * 10)
+)
+```
+
+## Strict Type System
+
+Polars enforces strict typing:
+
+**No silent conversions:**
+```python
+# This will error - can't mix types
+# df.with_columns(pl.col("int_col") + "string")
+
+# Must cast explicitly
+df.with_columns(
+    pl.col("int_col").cast(pl.Utf8) + "_suffix"
+)
+```
+
+**Benefits:**
+- Prevents silent bugs
+- Predictable behavior
+- Better performance
+- Clearer code intent
+
+**Integer nulls:**
+Unlike pandas, integer columns can have nulls without converting to float:
+```python
+# In pandas: Int column with null becomes Float
+# In polars: Int column with null stays Int (with null values)
+df = pl.DataFrame({"int_col": [1, 2, None, 4]})
+# dtype: Int64 (not Float64)
+```
--- a/skills/polars/references/io_guide.md
+++ b/skills/polars/references/io_guide.md
@@ -0,0 +1,557 @@
+# Polars Data I/O Guide
+
+Comprehensive guide to reading and writing data in various formats with Polars.
+
+## CSV Files
+
+### Reading CSV
+
+**Eager mode (loads into memory):**
+```python
+import polars as pl
+
+# Basic read
+df = pl.read_csv("data.csv")
+
+# With options
+df = pl.read_csv(
+    "data.csv",
+    separator=",",
+    has_header=True,
+    columns=["col1", "col2"],  # Select specific columns
+    n_rows=1000,  # Read only first 1000 rows
+    skip_rows=10,  # Skip first 10 rows
+    dtypes={"col1": pl.Int64, "col2": pl.Utf8},  # Specify types
+    null_values=["NA", "null", ""],  # Define null values
+    encoding="utf-8",
+    ignore_errors=False
+)
+```
+
+**Lazy mode (scans without loading - recommended for large files):**
+```python
+# Scan CSV (builds query plan)
+lf = pl.scan_csv("data.csv")
+
+# Apply operations
+result = lf.filter(pl.col("age") > 25).select("name", "age")
+
+# Execute and load
+df = result.collect()
+```
+
+### Writing CSV
+
+```python
+# Basic write
+df.write_csv("output.csv")
+
+# With options
+df.write_csv(
+    "output.csv",
+    separator=",",
+    include_header=True,
+    null_value="",  # How to represent nulls
+    quote_char='"',
+    line_terminator="\n"
+)
+```
+
+### Multiple CSV Files
+
+**Read multiple files:**
+```python
+# Read all CSVs in directory
+lf = pl.scan_csv("data/*.csv")
+
+# Read specific files
+lf = pl.scan_csv(["file1.csv", "file2.csv", "file3.csv"])
+```
+
+## Parquet Files
+
+Parquet is the recommended format for performance and compression.
+
+### Reading Parquet
+
+**Eager:**
+```python
+df = pl.read_parquet("data.parquet")
+
+# With options
+df = pl.read_parquet(
+    "data.parquet",
+    columns=["col1", "col2"],  # Select specific columns
+    n_rows=1000,  # Read first N rows
+    parallel="auto"  # Control parallelization
+)
+```
+
+**Lazy (recommended):**
+```python
+lf = pl.scan_parquet("data.parquet")
+
+# Automatic predicate and projection pushdown
+result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
+```
+
+### Writing Parquet
+
+```python
+# Basic write
+df.write_parquet("output.parquet")
+
+# With compression
+df.write_parquet(
+    "output.parquet",
+    compression="snappy",  # Options: "snappy", "gzip", "brotli", "lz4", "zstd"
+    statistics=True,  # Write statistics (enables predicate pushdown)
+    use_pyarrow=False  # Use Rust writer (faster)
+)
+```
+
+### Partitioned Parquet (Hive-style)
+
+**Write partitioned:**
+```python
+# Write with partitioning
+df.write_parquet(
+    "output_dir",
+    partition_by=["year", "month"]  # Creates directory structure
+)
+# Creates: output_dir/year=2023/month=01/data.parquet
+```
+
+**Read partitioned:**
+```python
+lf = pl.scan_parquet("output_dir/**/*.parquet")
+
+# Hive partitioning columns are automatically added
+result = lf.filter(pl.col("year") == 2023).collect()
+```
+
+## JSON Files
+
+### Reading JSON
+
+**NDJSON (newline-delimited JSON) - recommended:**
+```python
+df = pl.read_ndjson("data.ndjson")
+
+# Lazy
+lf = pl.scan_ndjson("data.ndjson")
+```
+
+**Standard JSON:**
+```python
+df = pl.read_json("data.json")
+
+# From JSON string
+df = pl.read_json('{"col1": [1, 2], "col2": ["a", "b"]}')
+```
+
+### Writing JSON
+
+```python
+# Write NDJSON
+df.write_ndjson("output.ndjson")
+
+# Write standard JSON
+df.write_json("output.json")
+
+# Pretty printed
+df.write_json("output.json", pretty=True, row_oriented=False)
+```
+
+## Excel Files
+
+### Reading Excel
+
+```python
+# Read first sheet
+df = pl.read_excel("data.xlsx")
+
+# Specific sheet
+df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
+# Or by index
+df = pl.read_excel("data.xlsx", sheet_id=0)
+
+# With options
+df = pl.read_excel(
+    "data.xlsx",
+    sheet_name="Sheet1",
+    columns=["A", "B", "C"],  # Excel columns
+    n_rows=100,
+    skip_rows=5,
+    has_header=True
+)
+```
+
+### Writing Excel
+
+```python
+# Write to Excel
+df.write_excel("output.xlsx")
+
+# Multiple sheets
+with pl.ExcelWriter("output.xlsx") as writer:
+    df1.write_excel(writer, worksheet="Sheet1")
+    df2.write_excel(writer, worksheet="Sheet2")
+```
+
+## Database Connectivity
+
+### Read from Database
+
+```python
+import polars as pl
+
+# Read entire table
+df = pl.read_database("SELECT * FROM users", connection_uri="postgresql://...")
+
+# Using connectorx for better performance
+df = pl.read_database_uri(
+    "SELECT * FROM users WHERE age > 25",
+    uri="postgresql://user:pass@localhost/db"
+)
+```
+
+### Write to Database
+
+```python
+# Using SQLAlchemy
+from sqlalchemy import create_engine
+
+engine = create_engine("postgresql://user:pass@localhost/db")
+df.write_database("table_name", connection=engine)
+
+# With options
+df.write_database(
+    "table_name",
+    connection=engine,
+    if_exists="replace",  # or "append", "fail"
+)
+```
+
+### Common Database Connectors
+
+**PostgreSQL:**
+```python
+uri = "postgresql://username:password@localhost:5432/database"
+df = pl.read_database_uri("SELECT * FROM table", uri=uri)
+```
+
+**MySQL:**
+```python
+uri = "mysql://username:password@localhost:3306/database"
+df = pl.read_database_uri("SELECT * FROM table", uri=uri)
+```
+
+**SQLite:**
+```python
+uri = "sqlite:///path/to/database.db"
+df = pl.read_database_uri("SELECT * FROM table", uri=uri)
+```
+
+## Cloud Storage
+
+### AWS S3
+
+```python
+# Read from S3
+df = pl.read_parquet("s3://bucket/path/to/file.parquet")
+lf = pl.scan_parquet("s3://bucket/path/*.parquet")
+
+# Write to S3
+df.write_parquet("s3://bucket/path/output.parquet")
+
+# With credentials
+import os
+os.environ["AWS_ACCESS_KEY_ID"] = "your_key"
+os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret"
+os.environ["AWS_REGION"] = "us-west-2"
+
+df = pl.read_parquet("s3://bucket/file.parquet")
+```
+
+### Azure Blob Storage
+
+```python
+# Read from Azure
+df = pl.read_parquet("az://container/path/file.parquet")
+
+# Write to Azure
+df.write_parquet("az://container/path/output.parquet")
+
+# With credentials
+os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "account"
+os.environ["AZURE_STORAGE_ACCOUNT_KEY"] = "key"
+```
+
+### Google Cloud Storage (GCS)
+
+```python
+# Read from GCS
+df = pl.read_parquet("gs://bucket/path/file.parquet")
+
+# Write to GCS
+df.write_parquet("gs://bucket/path/output.parquet")
+
+# With credentials
+os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials.json"
+```
+
+## Google BigQuery
+
+```python
+# Read from BigQuery
+df = pl.read_database(
+    "SELECT * FROM project.dataset.table",
+    connection_uri="bigquery://project"
+)
+
+# Or using Google Cloud SDK
+from google.cloud import bigquery
+client = bigquery.Client()
+
+query = "SELECT * FROM project.dataset.table WHERE date > '2023-01-01'"
+df = pl.from_pandas(client.query(query).to_dataframe())
+```
+
+## Apache Arrow
+
+### IPC/Feather Format
+
+**Read:**
+```python
+df = pl.read_ipc("data.arrow")
+lf = pl.scan_ipc("data.arrow")
+```
+
+**Write:**
+```python
+df.write_ipc("output.arrow")
+
+# Compressed
+df.write_ipc("output.arrow", compression="zstd")
+```
+
+### Arrow Streaming
+
+```python
+# Write streaming format
+df.write_ipc("output.arrows", compression="zstd")
+
+# Read streaming
+df = pl.read_ipc("output.arrows")
+```
+
+### From/To Arrow
+
+```python
+import pyarrow as pa
+
+# From Arrow Table
+arrow_table = pa.table({"col": [1, 2, 3]})
+df = pl.from_arrow(arrow_table)
+
+# To Arrow Table
+arrow_table = df.to_arrow()
+```
+
+## In-Memory Formats
+
+### Python Dictionaries
+
+```python
+# From dict
+df = pl.DataFrame({
+    "col1": [1, 2, 3],
+    "col2": ["a", "b", "c"]
+})
+
+# To dict
+data_dict = df.to_dict()  # Column-oriented
+data_dict = df.to_dict(as_series=False)  # Lists instead of Series
+```
+
+### NumPy Arrays
+
+```python
+import numpy as np
+
+# From NumPy
+arr = np.array([[1, 2], [3, 4], [5, 6]])
+df = pl.DataFrame(arr, schema=["col1", "col2"])
+
+# To NumPy
+arr = df.to_numpy()
+```
+
+### Pandas DataFrames
+
+```python
+import pandas as pd
+
+# From Pandas
+pd_df = pd.DataFrame({"col": [1, 2, 3]})
+pl_df = pl.from_pandas(pd_df)
+
+# To Pandas
+pd_df = pl_df.to_pandas()
+
+# Zero-copy when possible
+pl_df = pl.from_arrow(pd_df)
+```
+
+### Lists of Rows
+
+```python
+# From list of dicts
+data = [
+    {"name": "Alice", "age": 25},
+    {"name": "Bob", "age": 30}
+]
+df = pl.DataFrame(data)
+
+# To list of dicts
+rows = df.to_dicts()
+
+# From list of tuples
+data = [("Alice", 25), ("Bob", 30)]
+df = pl.DataFrame(data, schema=["name", "age"])
+```
+
+## Streaming Large Files
+
+For datasets larger than memory, use lazy mode with streaming:
+
+```python
+# Streaming mode
+lf = pl.scan_csv("very_large.csv")
+result = lf.filter(pl.col("value") > 100).collect(streaming=True)
+
+# Streaming with multiple files
+lf = pl.scan_parquet("data/*.parquet")
+result = lf.group_by("category").agg(pl.col("value").sum()).collect(streaming=True)
+```
+
+## Best Practices
+
+### Format Selection
+
+**Use Parquet when:**
+- Need compression (up to 10x smaller than CSV)
+- Want fast reads/writes
+- Need to preserve data types
+- Working with large datasets
+- Need predicate pushdown
+
+**Use CSV when:**
+- Need human-readable format
+- Interfacing with legacy systems
+- Data is small
+- Need universal compatibility
+
+**Use JSON when:**
+- Working with nested/hierarchical data
+- Need web API compatibility
+- Data has flexible schema
+
+**Use Arrow IPC when:**
+- Need zero-copy data sharing
+- Fastest serialization required
+- Working between Arrow-compatible systems
+
+### Reading Large Files
+
+```python
+# 1. Always use lazy mode
+lf = pl.scan_csv("large.csv")  # NOT read_csv
+
+# 2. Filter and select early (pushdown optimization)
+result = (
+    lf
+    .select("col1", "col2", "col3")  # Only needed columns
+    .filter(pl.col("date") > "2023-01-01")  # Filter early
+    .collect()
+)
+
+# 3. Use streaming for very large data
+result = lf.filter(...).select(...).collect(streaming=True)
+
+# 4. Read only needed rows during development
+df = pl.read_csv("large.csv", n_rows=10000)  # Sample for testing
+```
+
+### Writing Large Files
+
+```python
+# 1. Use Parquet with compression
+df.write_parquet("output.parquet", compression="zstd")
+
+# 2. Use partitioning for very large datasets
+df.write_parquet("output", partition_by=["year", "month"])
+
+# 3. Write streaming
+lf = pl.scan_csv("input.csv")
+lf.sink_parquet("output.parquet")  # Streaming write
+```
+
+### Performance Tips
+
+```python
+# 1. Specify dtypes when reading CSV
+df = pl.read_csv(
+    "data.csv",
+    dtypes={"id": pl.Int64, "name": pl.Utf8}  # Avoids inference
+)
+
+# 2. Use appropriate compression
+df.write_parquet("output.parquet", compression="snappy")  # Fast
+df.write_parquet("output.parquet", compression="zstd")    # Better compression
+
+# 3. Parallel reading
+df = pl.read_csv("data.csv", parallel="auto")
+
+# 4. Read multiple files in parallel
+lf = pl.scan_parquet("data/*.parquet")  # Automatic parallel read
+```
+
+## Error Handling
+
+```python
+try:
+    df = pl.read_csv("data.csv")
+except pl.exceptions.ComputeError as e:
+    print(f"Error reading CSV: {e}")
+
+# Ignore errors during parsing
+df = pl.read_csv("messy.csv", ignore_errors=True)
+
+# Handle missing files
+from pathlib import Path
+if Path("data.csv").exists():
+    df = pl.read_csv("data.csv")
+else:
+    print("File not found")
+```
+
+## Schema Management
+
+```python
+# Infer schema from sample
+schema = pl.read_csv("data.csv", n_rows=1000).schema
+
+# Use inferred schema for full read
+df = pl.read_csv("data.csv", dtypes=schema)
+
+# Define schema explicitly
+schema = {
+    "id": pl.Int64,
+    "name": pl.Utf8,
+    "date": pl.Date,
+    "value": pl.Float64
+}
+df = pl.read_csv("data.csv", dtypes=schema)
+```
--- a/skills/polars/references/operations.md
+++ b/skills/polars/references/operations.md
@@ -0,0 +1,602 @@
+# Polars Operations Reference
+
+This reference covers all common Polars operations with comprehensive examples.
+
+## Selection Operations
+
+### Select Columns
+
+**Basic selection:**
+```python
+# Select specific columns
+df.select("name", "age", "city")
+
+# Using expressions
+df.select(pl.col("name"), pl.col("age"))
+```
+
+**Pattern-based selection:**
+```python
+# All columns starting with "sales_"
+df.select(pl.col("^sales_.*$"))
+
+# All numeric columns
+df.select(pl.col(pl.NUMERIC_DTYPES))
+
+# All columns except specific ones
+df.select(pl.all().exclude("id", "timestamp"))
+```
+
+**Computed columns:**
+```python
+df.select(
+    "name",
+    (pl.col("age") * 12).alias("age_in_months"),
+    (pl.col("salary") * 1.1).alias("salary_after_raise")
+)
+```
+
+### With Columns (Add/Modify)
+
+Add new columns or modify existing ones while preserving all other columns:
+
+```python
+# Add new columns
+df.with_columns(
+    age_doubled=pl.col("age") * 2,
+    full_name=pl.col("first_name") + " " + pl.col("last_name")
+)
+
+# Modify existing columns
+df.with_columns(
+    pl.col("name").str.to_uppercase().alias("name"),
+    pl.col("salary").cast(pl.Float64).alias("salary")
+)
+
+# Multiple operations in parallel
+df.with_columns(
+    pl.col("value") * 10,
+    pl.col("value") * 100,
+    pl.col("value") * 1000,
+)
+```
+
+## Filtering Operations
+
+### Basic Filtering
+
+```python
+# Single condition
+df.filter(pl.col("age") > 25)
+
+# Multiple conditions (AND)
+df.filter(
+    pl.col("age") > 25,
+    pl.col("city") == "NY"
+)
+
+# OR conditions
+df.filter(
+    (pl.col("age") > 30) | (pl.col("salary") > 100000)
+)
+
+# NOT condition
+df.filter(~pl.col("active"))
+df.filter(pl.col("city") != "NY")
+```
+
+### Advanced Filtering
+
+**String operations:**
+```python
+# Contains substring
+df.filter(pl.col("name").str.contains("John"))
+
+# Starts with
+df.filter(pl.col("email").str.starts_with("admin"))
+
+# Regex match
+df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{3}-\d{4}$"))
+```
+
+**Membership checks:**
+```python
+# In list
+df.filter(pl.col("city").is_in(["NY", "LA", "SF"]))
+
+# Not in list
+df.filter(~pl.col("status").is_in(["inactive", "deleted"]))
+```
+
+**Range filters:**
+```python
+# Between values
+df.filter(pl.col("age").is_between(25, 35))
+
+# Date range
+df.filter(
+    pl.col("date") >= pl.date(2023, 1, 1),
+    pl.col("date") <= pl.date(2023, 12, 31)
+)
+```
+
+**Null filtering:**
+```python
+# Filter out nulls
+df.filter(pl.col("value").is_not_null())
+
+# Keep only nulls
+df.filter(pl.col("value").is_null())
+```
+
+## Grouping and Aggregation
+
+### Basic Group By
+
+```python
+# Group by single column
+df.group_by("department").agg(
+    pl.col("salary").mean().alias("avg_salary"),
+    pl.len().alias("employee_count")
+)
+
+# Group by multiple columns
+df.group_by("department", "location").agg(
+    pl.col("salary").sum()
+)
+
+# Maintain order
+df.group_by("category", maintain_order=True).agg(
+    pl.col("value").sum()
+)
+```
+
+### Aggregation Functions
+
+**Count and length:**
+```python
+df.group_by("category").agg(
+    pl.len().alias("count"),
+    pl.col("id").count().alias("non_null_count"),
+    pl.col("id").n_unique().alias("unique_count")
+)
+```
+
+**Statistical aggregations:**
+```python
+df.group_by("group").agg(
+    pl.col("value").sum().alias("total"),
+    pl.col("value").mean().alias("average"),
+    pl.col("value").median().alias("median"),
+    pl.col("value").std().alias("std_dev"),
+    pl.col("value").var().alias("variance"),
+    pl.col("value").min().alias("minimum"),
+    pl.col("value").max().alias("maximum"),
+    pl.col("value").quantile(0.95).alias("p95")
+)
+```
+
+**First and last:**
+```python
+df.group_by("user_id").agg(
+    pl.col("timestamp").first().alias("first_seen"),
+    pl.col("timestamp").last().alias("last_seen"),
+    pl.col("event").first().alias("first_event")
+)
+```
+
+**List aggregation:**
+```python
+# Collect values into lists
+df.group_by("category").agg(
+    pl.col("item").alias("all_items")  # Creates list column
+)
+```
+
+### Conditional Aggregations
+
+Filter within aggregations:
+
+```python
+df.group_by("department").agg(
+    # Count high earners
+    (pl.col("salary") > 100000).sum().alias("high_earners"),
+
+    # Average of filtered values
+    pl.col("salary").filter(pl.col("bonus") > 0).mean().alias("avg_with_bonus"),
+
+    # Conditional sum
+    pl.when(pl.col("active"))
+      .then(pl.col("sales"))
+      .otherwise(0)
+      .sum()
+      .alias("active_sales")
+)
+```
+
+### Multiple Aggregations
+
+Combine multiple aggregations efficiently:
+
+```python
+df.group_by("store_id").agg(
+    pl.col("transaction_id").count().alias("num_transactions"),
+    pl.col("amount").sum().alias("total_sales"),
+    pl.col("amount").mean().alias("avg_transaction"),
+    pl.col("customer_id").n_unique().alias("unique_customers"),
+    pl.col("amount").max().alias("largest_transaction"),
+    pl.col("timestamp").min().alias("first_transaction_date"),
+    pl.col("timestamp").max().alias("last_transaction_date")
+)
+```
+
+## Window Functions
+
+Window functions apply aggregations while preserving the original row count.
+
+### Basic Window Operations
+
+**Group statistics:**
+```python
+# Add group mean to each row
+df.with_columns(
+    avg_age_by_dept=pl.col("age").mean().over("department")
+)
+
+# Multiple group columns
+df.with_columns(
+    group_avg=pl.col("value").mean().over("category", "region")
+)
+```
+
+**Ranking:**
+```python
+df.with_columns(
+    # Rank within groups
+    rank=pl.col("score").rank().over("team"),
+
+    # Dense rank (no gaps)
+    dense_rank=pl.col("score").rank(method="dense").over("team"),
+
+    # Row number
+    row_num=pl.col("timestamp").sort().rank(method="ordinal").over("user_id")
+)
+```
+
+### Window Mapping Strategies
+
+**group_to_rows (default):**
+Preserves original row order:
+```python
+df.with_columns(
+    group_mean=pl.col("value").mean().over("category", mapping_strategy="group_to_rows")
+)
+```
+
+**explode:**
+Faster, groups rows together:
+```python
+df.with_columns(
+    group_mean=pl.col("value").mean().over("category", mapping_strategy="explode")
+)
+```
+
+**join:**
+Creates list columns:
+```python
+df.with_columns(
+    group_values=pl.col("value").over("category", mapping_strategy="join")
+)
+```
+
+### Rolling Windows
+
+**Time-based rolling:**
+```python
+df.with_columns(
+    rolling_avg=pl.col("value").rolling_mean(
+        window_size="7d",
+        by="date"
+    )
+)
+```
+
+**Row-based rolling:**
+```python
+df.with_columns(
+    rolling_sum=pl.col("value").rolling_sum(window_size=3),
+    rolling_max=pl.col("value").rolling_max(window_size=5)
+)
+```
+
+### Cumulative Operations
+
+```python
+df.with_columns(
+    cumsum=pl.col("value").cum_sum().over("group"),
+    cummax=pl.col("value").cum_max().over("group"),
+    cummin=pl.col("value").cum_min().over("group"),
+    cumprod=pl.col("value").cum_prod().over("group")
+)
+```
+
+### Shift and Lag/Lead
+
+```python
+df.with_columns(
+    # Previous value (lag)
+    prev_value=pl.col("value").shift(1).over("user_id"),
+
+    # Next value (lead)
+    next_value=pl.col("value").shift(-1).over("user_id"),
+
+    # Calculate difference from previous
+    diff=pl.col("value") - pl.col("value").shift(1).over("user_id")
+)
+```
+
+## Sorting
+
+### Basic Sorting
+
+```python
+# Sort by single column
+df.sort("age")
+
+# Sort descending
+df.sort("age", descending=True)
+
+# Sort by multiple columns
+df.sort("department", "age")
+
+# Mixed sorting order
+df.sort(["department", "salary"], descending=[False, True])
+```
+
+### Advanced Sorting
+
+**Null handling:**
+```python
+# Nulls first
+df.sort("value", nulls_last=False)
+
+# Nulls last
+df.sort("value", nulls_last=True)
+```
+
+**Sort by expression:**
+```python
+# Sort by computed value
+df.sort(pl.col("first_name").str.len())
+
+# Sort by multiple expressions
+df.sort(
+    pl.col("last_name").str.to_lowercase(),
+    pl.col("age").abs()
+)
+```
+
+## Conditional Operations
+
+### When/Then/Otherwise
+
+```python
+# Basic conditional
+df.with_columns(
+    status=pl.when(pl.col("age") >= 18)
+        .then("adult")
+        .otherwise("minor")
+)
+
+# Multiple conditions
+df.with_columns(
+    category=pl.when(pl.col("score") >= 90)
+        .then("A")
+        .when(pl.col("score") >= 80)
+        .then("B")
+        .when(pl.col("score") >= 70)
+        .then("C")
+        .otherwise("F")
+)
+
+# Conditional computation
+df.with_columns(
+    adjusted_price=pl.when(pl.col("is_member"))
+        .then(pl.col("price") * 0.9)
+        .otherwise(pl.col("price"))
+)
+```
+
+## String Operations
+
+### Common String Methods
+
+```python
+df.with_columns(
+    # Case conversion
+    upper=pl.col("name").str.to_uppercase(),
+    lower=pl.col("name").str.to_lowercase(),
+    title=pl.col("name").str.to_titlecase(),
+
+    # Trimming
+    trimmed=pl.col("text").str.strip_chars(),
+
+    # Substring
+    first_3=pl.col("name").str.slice(0, 3),
+
+    # Replace
+    cleaned=pl.col("text").str.replace("old", "new"),
+    cleaned_all=pl.col("text").str.replace_all("old", "new"),
+
+    # Split
+    parts=pl.col("full_name").str.split(" "),
+
+    # Length
+    name_length=pl.col("name").str.len_chars()
+)
+```
+
+### String Filtering
+
+```python
+# Contains
+df.filter(pl.col("email").str.contains("@gmail.com"))
+
+# Starts/ends with
+df.filter(pl.col("name").str.starts_with("A"))
+df.filter(pl.col("file").str.ends_with(".csv"))
+
+# Regex matching
+df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{4}$"))
+```
+
+## Date and Time Operations
+
+### Date Parsing
+
+```python
+# Parse strings to dates
+df.with_columns(
+    date=pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d"),
+    datetime=pl.col("dt_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")
+)
+```
+
+### Date Components
+
+```python
+df.with_columns(
+    year=pl.col("date").dt.year(),
+    month=pl.col("date").dt.month(),
+    day=pl.col("date").dt.day(),
+    weekday=pl.col("date").dt.weekday(),
+    hour=pl.col("datetime").dt.hour(),
+    minute=pl.col("datetime").dt.minute()
+)
+```
+
+### Date Arithmetic
+
+```python
+# Add duration
+df.with_columns(
+    next_week=pl.col("date") + pl.duration(weeks=1),
+    next_month=pl.col("date") + pl.duration(months=1)
+)
+
+# Difference between dates
+df.with_columns(
+    days_diff=(pl.col("end_date") - pl.col("start_date")).dt.total_days()
+)
+```
+
+### Date Filtering
+
+```python
+# Filter by date range
+df.filter(
+    pl.col("date").is_between(pl.date(2023, 1, 1), pl.date(2023, 12, 31))
+)
+
+# Filter by year
+df.filter(pl.col("date").dt.year() == 2023)
+
+# Filter by month
+df.filter(pl.col("date").dt.month().is_in([6, 7, 8]))  # Summer months
+```
+
+## List Operations
+
+### Working with List Columns
+
+```python
+# Create list column
+df.with_columns(
+    items_list=pl.col("item1", "item2", "item3").to_list()
+)
+
+# List operations
+df.with_columns(
+    list_len=pl.col("items").list.len(),
+    first_item=pl.col("items").list.first(),
+    last_item=pl.col("items").list.last(),
+    unique_items=pl.col("items").list.unique(),
+    sorted_items=pl.col("items").list.sort()
+)
+
+# Explode lists to rows
+df.explode("items")
+
+# Filter list elements
+df.with_columns(
+    filtered=pl.col("items").list.eval(pl.element() > 10)
+)
+```
+
+## Struct Operations
+
+### Working with Nested Structures
+
+```python
+# Create struct column
+df.with_columns(
+    address=pl.struct(["street", "city", "zip"])
+)
+
+# Access struct fields
+df.with_columns(
+    city=pl.col("address").struct.field("city")
+)
+
+# Unnest struct to columns
+df.unnest("address")
+```
+
+## Unique and Duplicate Operations
+
+```python
+# Get unique rows
+df.unique()
+
+# Unique on specific columns
+df.unique(subset=["name", "email"])
+
+# Keep first/last duplicate
+df.unique(subset=["id"], keep="first")
+df.unique(subset=["id"], keep="last")
+
+# Identify duplicates
+df.with_columns(
+    is_duplicate=pl.col("id").is_duplicated()
+)
+
+# Count duplicates
+df.group_by("email").agg(
+    pl.len().alias("count")
+).filter(pl.col("count") > 1)
+```
+
+## Sampling
+
+```python
+# Random sample
+df.sample(n=100)
+
+# Sample fraction
+df.sample(fraction=0.1)
+
+# Sample with seed for reproducibility
+df.sample(n=100, seed=42)
+```
+
+## Column Renaming
+
+```python
+# Rename specific columns
+df.rename({"old_name": "new_name", "age": "years"})
+
+# Rename with expression
+df.select(pl.col("*").name.suffix("_renamed"))
+df.select(pl.col("*").name.prefix("data_"))
+df.select(pl.col("*").name.to_uppercase())
+```
--- a/skills/polars/references/pandas_migration.md
+++ b/skills/polars/references/pandas_migration.md
@@ -0,0 +1,417 @@
+# Pandas to Polars Migration Guide
+
+This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.
+
+## Core Conceptual Differences
+
+### 1. No Index System
+
+**Pandas:** Uses row-based indexing system
+```python
+df.loc[0, "column"]
+df.iloc[0:5]
+df.set_index("id")
+```
+
+**Polars:** Uses integer positions only
+```python
+df[0, "column"]  # Row position, column name
+df[0:5]  # Row slice
+# No set_index equivalent - use group_by instead
+```
+
+### 2. Memory Format
+
+**Pandas:** Row-oriented NumPy arrays
+**Polars:** Columnar Apache Arrow format
+
+**Implications:**
+- Polars is faster for column operations
+- Polars uses less memory
+- Polars has better data sharing capabilities
+
+### 3. Parallelization
+
+**Pandas:** Primarily single-threaded (requires Dask for parallelism)
+**Polars:** Parallel by default using Rust's concurrency
+
+### 4. Lazy Evaluation
+
+**Pandas:** Only eager evaluation
+**Polars:** Both eager (DataFrame) and lazy (LazyFrame) with query optimization
+
+### 5. Type Strictness
+
+**Pandas:** Allows silent type conversions
+**Polars:** Strict typing, explicit casts required
+
+**Example:**
+```python
+# Pandas: Silently converts to float
+pd_df["int_col"] = [1, 2, None, 4]  # dtype: float64
+
+# Polars: Keeps as integer with null
+pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]})  # dtype: Int64
+```
+
+## Operation Mappings
+
+### Data Selection
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Select column | `df["col"]` or `df.col` | `df.select("col")` or `df["col"]` |
+| Select multiple | `df[["a", "b"]]` | `df.select("a", "b")` |
+| Select by position | `df.iloc[:, 0:3]` | `df.select(pl.col(df.columns[0:3]))` |
+| Select by condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
+
+### Data Filtering
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Single condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
+| Multiple conditions | `df[(df["age"] > 25) & (df["city"] == "NY")]` | `df.filter(pl.col("age") > 25, pl.col("city") == "NY")` |
+| Query method | `df.query("age > 25")` | `df.filter(pl.col("age") > 25)` |
+| isin | `df[df["city"].isin(["NY", "LA"])]` | `df.filter(pl.col("city").is_in(["NY", "LA"]))` |
+| isna | `df[df["value"].isna()]` | `df.filter(pl.col("value").is_null())` |
+| notna | `df[df["value"].notna()]` | `df.filter(pl.col("value").is_not_null())` |
+
+### Adding/Modifying Columns
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Add column | `df["new"] = df["old"] * 2` | `df.with_columns(new=pl.col("old") * 2)` |
+| Multiple columns | `df.assign(a=..., b=...)` | `df.with_columns(a=..., b=...)` |
+| Conditional column | `np.where(condition, a, b)` | `pl.when(condition).then(a).otherwise(b)` |
+
+**Important difference - Parallel execution:**
+
+```python
+# Pandas: Sequential (lambda sees previous results)
+df.assign(
+    a=lambda df_: df_.value * 10,
+    b=lambda df_: df_.value * 100
+)
+
+# Polars: Parallel (all computed together)
+df.with_columns(
+    a=pl.col("value") * 10,
+    b=pl.col("value") * 100
+)
+```
+
+### Grouping and Aggregation
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Group by | `df.groupby("col")` | `df.group_by("col")` |
+| Agg single | `df.groupby("col")["val"].mean()` | `df.group_by("col").agg(pl.col("val").mean())` |
+| Agg multiple | `df.groupby("col").agg({"val": ["mean", "sum"]})` | `df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum())` |
+| Size | `df.groupby("col").size()` | `df.group_by("col").agg(pl.len())` |
+| Count | `df.groupby("col").count()` | `df.group_by("col").agg(pl.col("*").count())` |
+
+### Window Functions
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Transform | `df.groupby("col").transform("mean")` | `df.with_columns(pl.col("val").mean().over("col"))` |
+| Rank | `df.groupby("col")["val"].rank()` | `df.with_columns(pl.col("val").rank().over("col"))` |
+| Shift | `df.groupby("col")["val"].shift(1)` | `df.with_columns(pl.col("val").shift(1).over("col"))` |
+| Cumsum | `df.groupby("col")["val"].cumsum()` | `df.with_columns(pl.col("val").cum_sum().over("col"))` |
+
+### Joins
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Inner join | `df1.merge(df2, on="id")` | `df1.join(df2, on="id", how="inner")` |
+| Left join | `df1.merge(df2, on="id", how="left")` | `df1.join(df2, on="id", how="left")` |
+| Different keys | `df1.merge(df2, left_on="a", right_on="b")` | `df1.join(df2, left_on="a", right_on="b")` |
+
+### Concatenation
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Vertical | `pd.concat([df1, df2], axis=0)` | `pl.concat([df1, df2], how="vertical")` |
+| Horizontal | `pd.concat([df1, df2], axis=1)` | `pl.concat([df1, df2], how="horizontal")` |
+
+### Sorting
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Sort by column | `df.sort_values("col")` | `df.sort("col")` |
+| Descending | `df.sort_values("col", ascending=False)` | `df.sort("col", descending=True)` |
+| Multiple columns | `df.sort_values(["a", "b"])` | `df.sort("a", "b")` |
+
+### Reshaping
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Pivot | `df.pivot(index="a", columns="b", values="c")` | `df.pivot(values="c", index="a", columns="b")` |
+| Melt | `df.melt(id_vars="id")` | `df.unpivot(index="id")` |
+
+### I/O Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Read CSV | `pd.read_csv("file.csv")` | `pl.read_csv("file.csv")` or `pl.scan_csv()` |
+| Write CSV | `df.to_csv("file.csv")` | `df.write_csv("file.csv")` |
+| Read Parquet | `pd.read_parquet("file.parquet")` | `pl.read_parquet("file.parquet")` |
+| Write Parquet | `df.to_parquet("file.parquet")` | `df.write_parquet("file.parquet")` |
+| Read Excel | `pd.read_excel("file.xlsx")` | `pl.read_excel("file.xlsx")` |
+
+### String Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Upper | `df["col"].str.upper()` | `df.select(pl.col("col").str.to_uppercase())` |
+| Lower | `df["col"].str.lower()` | `df.select(pl.col("col").str.to_lowercase())` |
+| Contains | `df["col"].str.contains("pattern")` | `df.filter(pl.col("col").str.contains("pattern"))` |
+| Replace | `df["col"].str.replace("old", "new")` | `df.select(pl.col("col").str.replace("old", "new"))` |
+| Split | `df["col"].str.split(" ")` | `df.select(pl.col("col").str.split(" "))` |
+
+### Datetime Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Parse dates | `pd.to_datetime(df["col"])` | `df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d"))` |
+| Year | `df["date"].dt.year` | `df.select(pl.col("date").dt.year())` |
+| Month | `df["date"].dt.month` | `df.select(pl.col("date").dt.month())` |
+| Day | `df["date"].dt.day` | `df.select(pl.col("date").dt.day())` |
+
+### Missing Data
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Drop nulls | `df.dropna()` | `df.drop_nulls()` |
+| Fill nulls | `df.fillna(0)` | `df.fill_null(0)` |
+| Check null | `df["col"].isna()` | `df.select(pl.col("col").is_null())` |
+| Forward fill | `df.fillna(method="ffill")` | `df.select(pl.col("col").fill_null(strategy="forward"))` |
+
+### Other Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Unique values | `df["col"].unique()` | `df["col"].unique()` |
+| Value counts | `df["col"].value_counts()` | `df["col"].value_counts()` |
+| Describe | `df.describe()` | `df.describe()` |
+| Sample | `df.sample(n=100)` | `df.sample(n=100)` |
+| Head | `df.head()` | `df.head()` |
+| Tail | `df.tail()` | `df.tail()` |
+
+## Common Migration Patterns
+
+### Pattern 1: Chained Operations
+
+**Pandas:**
+```python
+result = (df
+    .assign(new_col=lambda x: x["old_col"] * 2)
+    .query("new_col > 10")
+    .groupby("category")
+    .agg({"value": "sum"})
+    .reset_index()
+)
+```
+
+**Polars:**
+```python
+result = (df
+    .with_columns(new_col=pl.col("old_col") * 2)
+    .filter(pl.col("new_col") > 10)
+    .group_by("category")
+    .agg(pl.col("value").sum())
+)
+# No reset_index needed - Polars doesn't have index
+```
+
+### Pattern 2: Apply Functions
+
+**Pandas:**
+```python
+# Avoid in Polars - breaks parallelization
+df["result"] = df["value"].apply(lambda x: x * 2)
+```
+
+**Polars:**
+```python
+# Use expressions instead
+df = df.with_columns(result=pl.col("value") * 2)
+
+# If custom function needed
+df = df.with_columns(
+    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
+)
+```
+
+### Pattern 3: Conditional Column Creation
+
+**Pandas:**
+```python
+df["category"] = np.where(
+    df["value"] > 100,
+    "high",
+    np.where(df["value"] > 50, "medium", "low")
+)
+```
+
+**Polars:**
+```python
+df = df.with_columns(
+    category=pl.when(pl.col("value") > 100)
+        .then("high")
+        .when(pl.col("value") > 50)
+        .then("medium")
+        .otherwise("low")
+)
+```
+
+### Pattern 4: Group Transform
+
+**Pandas:**
+```python
+df["group_mean"] = df.groupby("category")["value"].transform("mean")
+```
+
+**Polars:**
+```python
+df = df.with_columns(
+    group_mean=pl.col("value").mean().over("category")
+)
+```
+
+### Pattern 5: Multiple Aggregations
+
+**Pandas:**
+```python
+result = df.groupby("category").agg({
+    "value": ["mean", "sum", "count"],
+    "price": ["min", "max"]
+})
+```
+
+**Polars:**
+```python
+result = df.group_by("category").agg(
+    pl.col("value").mean().alias("value_mean"),
+    pl.col("value").sum().alias("value_sum"),
+    pl.col("value").count().alias("value_count"),
+    pl.col("price").min().alias("price_min"),
+    pl.col("price").max().alias("price_max")
+)
+```
+
+## Performance Anti-Patterns to Avoid
+
+### Anti-Pattern 1: Sequential Pipe Operations
+
+**Bad (disables parallelization):**
+```python
+df = df.pipe(function1).pipe(function2).pipe(function3)
+```
+
+**Good (enables parallelization):**
+```python
+df = df.with_columns(
+    function1_result(),
+    function2_result(),
+    function3_result()
+)
+```
+
+### Anti-Pattern 2: Python Functions in Hot Paths
+
+**Bad:**
+```python
+df = df.with_columns(
+    result=pl.col("value").map_elements(lambda x: x * 2)
+)
+```
+
+**Good:**
+```python
+df = df.with_columns(result=pl.col("value") * 2)
+```
+
+### Anti-Pattern 3: Using Eager Reading for Large Files
+
+**Bad:**
+```python
+df = pl.read_csv("large_file.csv")
+result = df.filter(pl.col("age") > 25).select("name", "age")
+```
+
+**Good:**
+```python
+lf = pl.scan_csv("large_file.csv")
+result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
+```
+
+### Anti-Pattern 4: Row Iteration
+
+**Bad:**
+```python
+for row in df.iter_rows():
+    # Process row
+    pass
+```
+
+**Good:**
+```python
+# Use vectorized operations
+df = df.with_columns(
+    # Vectorized computation
+)
+```
+
+## Migration Checklist
+
+When migrating from pandas to Polars:
+
+1. **Remove index operations** - Use integer positions or group_by
+2. **Replace apply/map with expressions** - Use Polars native operations
+3. **Update column assignment** - Use `with_columns()` instead of direct assignment
+4. **Change groupby.transform to .over()** - Window functions work differently
+5. **Update string operations** - Use `.str.to_uppercase()` instead of `.str.upper()`
+6. **Add explicit type casts** - Polars won't silently convert types
+7. **Consider lazy evaluation** - Use `scan_*` instead of `read_*` for large data
+8. **Update aggregation syntax** - More explicit in Polars
+9. **Remove reset_index calls** - Not needed in Polars
+10. **Update conditional logic** - Use `when().then().otherwise()` pattern
+
+## Compatibility Layer
+
+For gradual migration, you can use both libraries:
+
+```python
+import pandas as pd
+import polars as pl
+
+# Convert pandas to Polars
+pl_df = pl.from_pandas(pd_df)
+
+# Convert Polars to pandas
+pd_df = pl_df.to_pandas()
+
+# Use Arrow for zero-copy (when possible)
+pl_df = pl.from_arrow(pd_df)
+pd_df = pl_df.to_arrow().to_pandas()
+```
+
+## When to Stick with Pandas
+
+Consider staying with pandas when:
+- Working with time series requiring complex index operations
+- Need extensive ecosystem support (some libraries only support pandas)
+- Team lacks Rust/Polars expertise
+- Data is small and performance isn't critical
+- Using advanced pandas features without Polars equivalents
+
+## When to Switch to Polars
+
+Switch to Polars when:
+- Performance is critical
+- Working with large datasets (>1GB)
+- Need lazy evaluation and query optimization
+- Want better type safety
+- Need parallel execution by default
+- Starting a new project
--- a/skills/polars/references/transformations.md
+++ b/skills/polars/references/transformations.md
@@ -0,0 +1,549 @@
+# Polars Data Transformations
+
+Comprehensive guide to joins, concatenation, and reshaping operations in Polars.
+
+## Joins
+
+Joins combine data from multiple DataFrames based on common columns.
+
+### Basic Join Types
+
+**Inner Join (intersection):**
+```python
+# Keep only matching rows from both DataFrames
+result = df1.join(df2, on="id", how="inner")
+```
+
+**Left Join (all left + matches from right):**
+```python
+# Keep all rows from left, add matching rows from right
+result = df1.join(df2, on="id", how="left")
+```
+
+**Outer Join (union):**
+```python
+# Keep all rows from both DataFrames
+result = df1.join(df2, on="id", how="outer")
+```
+
+**Cross Join (Cartesian product):**
+```python
+# Every row from left with every row from right
+result = df1.join(df2, how="cross")
+```
+
+**Semi Join (filtered left):**
+```python
+# Keep only left rows that have a match in right
+result = df1.join(df2, on="id", how="semi")
+```
+
+**Anti Join (non-matching left):**
+```python
+# Keep only left rows that DON'T have a match in right
+result = df1.join(df2, on="id", how="anti")
+```
+
+### Join Syntax Variations
+
+**Single column join:**
+```python
+df1.join(df2, on="id")
+```
+
+**Multiple columns join:**
+```python
+df1.join(df2, on=["id", "date"])
+```
+
+**Different column names:**
+```python
+df1.join(df2, left_on="user_id", right_on="id")
+```
+
+**Multiple different columns:**
+```python
+df1.join(
+    df2,
+    left_on=["user_id", "date"],
+    right_on=["id", "timestamp"]
+)
+```
+
+### Suffix Handling
+
+When both DataFrames have columns with the same name (other than join keys):
+
+```python
+# Add suffixes to distinguish columns
+result = df1.join(df2, on="id", suffix="_right")
+
+# Results in: value, value_right (if both had "value" column)
+```
+
+### Join Examples
+
+**Example 1: Customer Orders**
+```python
+customers = pl.DataFrame({
+    "customer_id": [1, 2, 3, 4],
+    "name": ["Alice", "Bob", "Charlie", "David"]
+})
+
+orders = pl.DataFrame({
+    "order_id": [101, 102, 103],
+    "customer_id": [1, 2, 1],
+    "amount": [100, 200, 150]
+})
+
+# Inner join - only customers with orders
+result = customers.join(orders, on="customer_id", how="inner")
+
+# Left join - all customers, even without orders
+result = customers.join(orders, on="customer_id", how="left")
+```
+
+**Example 2: Time-series data**
+```python
+prices = pl.DataFrame({
+    "date": ["2023-01-01", "2023-01-02", "2023-01-03"],
+    "stock": ["AAPL", "AAPL", "AAPL"],
+    "price": [150, 152, 151]
+})
+
+volumes = pl.DataFrame({
+    "date": ["2023-01-01", "2023-01-02"],
+    "stock": ["AAPL", "AAPL"],
+    "volume": [1000000, 1100000]
+})
+
+result = prices.join(
+    volumes,
+    on=["date", "stock"],
+    how="left"
+)
+```
+
+### Asof Joins (Nearest Match)
+
+For time-series data, join to nearest timestamp:
+
+```python
+# Join to nearest earlier timestamp
+quotes = pl.DataFrame({
+    "timestamp": [1, 2, 3, 4, 5],
+    "stock": ["A", "A", "A", "A", "A"],
+    "quote": [100, 101, 102, 103, 104]
+})
+
+trades = pl.DataFrame({
+    "timestamp": [1.5, 3.5, 4.2],
+    "stock": ["A", "A", "A"],
+    "trade": [50, 75, 100]
+})
+
+result = trades.join_asof(
+    quotes,
+    on="timestamp",
+    by="stock",
+    strategy="backward"  # or "forward", "nearest"
+)
+```
+
+## Concatenation
+
+Concatenation stacks DataFrames together.
+
+### Vertical Concatenation (Stack Rows)
+
+```python
+df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
+df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})
+
+# Stack rows
+result = pl.concat([df1, df2], how="vertical")
+# Result: 4 rows, same columns
+```
+
+**Handling mismatched schemas:**
+```python
+df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
+df2 = pl.DataFrame({"a": [5, 6], "c": [7, 8]})
+
+# Diagonal concat - fills missing columns with nulls
+result = pl.concat([df1, df2], how="diagonal")
+# Result: columns a, b, c (with nulls where not present)
+```
+
+### Horizontal Concatenation (Stack Columns)
+
+```python
+df1 = pl.DataFrame({"a": [1, 2, 3]})
+df2 = pl.DataFrame({"b": [4, 5, 6]})
+
+# Stack columns
+result = pl.concat([df1, df2], how="horizontal")
+# Result: 3 rows, columns a and b
+```
+
+**Note:** Horizontal concat requires same number of rows.
+
+### Concatenation Options
+
+```python
+# Rechunk after concatenation (better performance for subsequent operations)
+result = pl.concat([df1, df2], rechunk=True)
+
+# Parallel execution
+result = pl.concat([df1, df2], parallel=True)
+```
+
+### Use Cases
+
+**Combining data from multiple sources:**
+```python
+# Read multiple files and concatenate
+files = ["data_2023.csv", "data_2024.csv", "data_2025.csv"]
+dfs = [pl.read_csv(f) for f in files]
+combined = pl.concat(dfs, how="vertical")
+```
+
+**Adding computed columns:**
+```python
+base = pl.DataFrame({"value": [1, 2, 3]})
+computed = pl.DataFrame({"doubled": [2, 4, 6]})
+result = pl.concat([base, computed], how="horizontal")
+```
+
+## Pivoting (Wide Format)
+
+Convert unique values from one column into multiple columns.
+
+### Basic Pivot
+
+```python
+df = pl.DataFrame({
+    "date": ["2023-01", "2023-01", "2023-02", "2023-02"],
+    "product": ["A", "B", "A", "B"],
+    "sales": [100, 150, 120, 160]
+})
+
+# Pivot: products become columns
+pivoted = df.pivot(
+    values="sales",
+    index="date",
+    columns="product"
+)
+# Result:
+# date     | A   | B
+# 2023-01  | 100 | 150
+# 2023-02  | 120 | 160
+```
+
+### Pivot with Aggregation
+
+When there are duplicate combinations, aggregate:
+
+```python
+df = pl.DataFrame({
+    "date": ["2023-01", "2023-01", "2023-01"],
+    "product": ["A", "A", "B"],
+    "sales": [100, 110, 150]
+})
+
+# Aggregate duplicates
+pivoted = df.pivot(
+    values="sales",
+    index="date",
+    columns="product",
+    aggregate_function="sum"  # or "mean", "max", "min", etc.
+)
+```
+
+### Multiple Index Columns
+
+```python
+df = pl.DataFrame({
+    "region": ["North", "North", "South", "South"],
+    "date": ["2023-01", "2023-01", "2023-01", "2023-01"],
+    "product": ["A", "B", "A", "B"],
+    "sales": [100, 150, 120, 160]
+})
+
+pivoted = df.pivot(
+    values="sales",
+    index=["region", "date"],
+    columns="product"
+)
+```
+
+## Unpivoting/Melting (Long Format)
+
+Convert multiple columns into rows (opposite of pivot).
+
+### Basic Unpivot
+
+```python
+df = pl.DataFrame({
+    "date": ["2023-01", "2023-02"],
+    "product_A": [100, 120],
+    "product_B": [150, 160]
+})
+
+# Unpivot: convert columns to rows
+unpivoted = df.unpivot(
+    index="date",
+    on=["product_A", "product_B"]
+)
+# Result:
+# date     | variable   | value
+# 2023-01  | product_A  | 100
+# 2023-01  | product_B  | 150
+# 2023-02  | product_A  | 120
+# 2023-02  | product_B  | 160
+```
+
+### Custom Column Names
+
+```python
+unpivoted = df.unpivot(
+    index="date",
+    on=["product_A", "product_B"],
+    variable_name="product",
+    value_name="sales"
+)
+```
+
+### Unpivot by Pattern
+
+```python
+# Unpivot all columns matching pattern
+df = pl.DataFrame({
+    "id": [1, 2],
+    "sales_Q1": [100, 200],
+    "sales_Q2": [150, 250],
+    "sales_Q3": [120, 220],
+    "revenue_Q1": [1000, 2000]
+})
+
+# Unpivot all sales columns
+unpivoted = df.unpivot(
+    index="id",
+    on=pl.col("^sales_.*$")
+)
+```
+
+## Exploding (Unnesting Lists)
+
+Convert list columns into multiple rows.
+
+### Basic Explode
+
+```python
+df = pl.DataFrame({
+    "id": [1, 2],
+    "values": [[1, 2, 3], [4, 5]]
+})
+
+# Explode list into rows
+exploded = df.explode("values")
+# Result:
+# id | values
+# 1  | 1
+# 1  | 2
+# 1  | 3
+# 2  | 4
+# 2  | 5
+```
+
+### Multiple Column Explode
+
+```python
+df = pl.DataFrame({
+    "id": [1, 2],
+    "letters": [["a", "b"], ["c", "d"]],
+    "numbers": [[1, 2], [3, 4]]
+})
+
+# Explode multiple columns (must be same length)
+exploded = df.explode("letters", "numbers")
+```
+
+## Transposing
+
+Swap rows and columns:
+
+```python
+df = pl.DataFrame({
+    "metric": ["sales", "costs", "profit"],
+    "Q1": [100, 60, 40],
+    "Q2": [150, 80, 70]
+})
+
+# Transpose
+transposed = df.transpose(
+    include_header=True,
+    header_name="quarter",
+    column_names="metric"
+)
+# Result: quarters as rows, metrics as columns
+```
+
+## Reshaping Patterns
+
+### Pattern 1: Wide to Long to Wide
+
+```python
+# Start wide
+wide = pl.DataFrame({
+    "id": [1, 2],
+    "A": [10, 20],
+    "B": [30, 40]
+})
+
+# To long
+long = wide.unpivot(index="id", on=["A", "B"])
+
+# Back to wide (maybe with transformations)
+wide_again = long.pivot(values="value", index="id", columns="variable")
+```
+
+### Pattern 2: Nested to Flat
+
+```python
+# Nested data
+df = pl.DataFrame({
+    "user": [1, 2],
+    "purchases": [
+        [{"item": "A", "qty": 2}, {"item": "B", "qty": 1}],
+        [{"item": "C", "qty": 3}]
+    ]
+})
+
+# Explode and unnest
+flat = (
+    df.explode("purchases")
+    .unnest("purchases")
+)
+```
+
+### Pattern 3: Aggregation to Pivot
+
+```python
+# Raw data
+sales = pl.DataFrame({
+    "date": ["2023-01", "2023-01", "2023-02"],
+    "product": ["A", "B", "A"],
+    "sales": [100, 150, 120]
+})
+
+# Aggregate then pivot
+result = (
+    sales
+    .group_by("date", "product")
+    .agg(pl.col("sales").sum())
+    .pivot(values="sales", index="date", columns="product")
+)
+```
+
+## Advanced Transformations
+
+### Conditional Reshaping
+
+```python
+# Pivot only certain values
+df.filter(pl.col("year") >= 2020).pivot(...)
+
+# Unpivot with filtering
+df.unpivot(index="id", on=pl.col("^sales.*$"))
+```
+
+### Multi-level Transformations
+
+```python
+# Complex reshaping pipeline
+result = (
+    df
+    .unpivot(index="id", on=pl.col("^Q[0-9]_.*$"))
+    .with_columns(
+        quarter=pl.col("variable").str.extract(r"Q([0-9])", 1),
+        metric=pl.col("variable").str.extract(r"Q[0-9]_(.*)", 1)
+    )
+    .drop("variable")
+    .pivot(values="value", index=["id", "quarter"], columns="metric")
+)
+```
+
+## Performance Considerations
+
+### Join Performance
+
+```python
+# 1. Join on indexed/sorted columns when possible
+df1_sorted = df1.sort("id")
+df2_sorted = df2.sort("id")
+result = df1_sorted.join(df2_sorted, on="id")
+
+# 2. Use appropriate join type
+# semi/anti are faster than inner+filter
+matches = df1.join(df2, on="id", how="semi")  # Better than filtering after inner join
+
+# 3. Filter before joining
+df1_filtered = df1.filter(pl.col("active"))
+result = df1_filtered.join(df2, on="id")  # Smaller join
+```
+
+### Concatenation Performance
+
+```python
+# 1. Rechunk after concatenation
+result = pl.concat(dfs, rechunk=True)
+
+# 2. Use lazy mode for large concatenations
+lf1 = pl.scan_parquet("file1.parquet")
+lf2 = pl.scan_parquet("file2.parquet")
+result = pl.concat([lf1, lf2]).collect()
+```
+
+### Pivot Performance
+
+```python
+# 1. Filter before pivoting
+pivoted = df.filter(pl.col("year") == 2023).pivot(...)
+
+# 2. Specify aggregate function explicitly
+pivoted = df.pivot(..., aggregate_function="first")  # Faster than "sum" if only one value
+```
+
+## Common Use Cases
+
+### Time Series Alignment
+
+```python
+# Align two time series with different timestamps
+ts1.join_asof(ts2, on="timestamp", strategy="backward")
+```
+
+### Feature Engineering
+
+```python
+# Create lag features
+df.with_columns(
+    pl.col("value").shift(1).over("user_id").alias("prev_value"),
+    pl.col("value").shift(2).over("user_id").alias("prev_prev_value")
+)
+```
+
+### Data Denormalization
+
+```python
+# Combine normalized tables
+orders.join(customers, on="customer_id").join(products, on="product_id")
+```
+
+### Report Generation
+
+```python
+# Pivot for reporting
+sales.pivot(values="amount", index="month", columns="product")
+```