Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/polars/references/pandas_migration.md
+++ b/skills/polars/references/pandas_migration.md
@@ -0,0 +1,417 @@
+# Pandas to Polars Migration Guide
+
+This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.
+
+## Core Conceptual Differences
+
+### 1. No Index System
+
+**Pandas:** Uses row-based indexing system
+```python
+df.loc[0, "column"]
+df.iloc[0:5]
+df.set_index("id")
+```
+
+**Polars:** Uses integer positions only
+```python
+df[0, "column"]  # Row position, column name
+df[0:5]  # Row slice
+# No set_index equivalent - use group_by instead
+```
+
+### 2. Memory Format
+
+**Pandas:** Row-oriented NumPy arrays
+**Polars:** Columnar Apache Arrow format
+
+**Implications:**
+- Polars is faster for column operations
+- Polars uses less memory
+- Polars has better data sharing capabilities
+
+### 3. Parallelization
+
+**Pandas:** Primarily single-threaded (requires Dask for parallelism)
+**Polars:** Parallel by default using Rust's concurrency
+
+### 4. Lazy Evaluation
+
+**Pandas:** Only eager evaluation
+**Polars:** Both eager (DataFrame) and lazy (LazyFrame) with query optimization
+
+### 5. Type Strictness
+
+**Pandas:** Allows silent type conversions
+**Polars:** Strict typing, explicit casts required
+
+**Example:**
+```python
+# Pandas: Silently converts to float
+pd_df["int_col"] = [1, 2, None, 4]  # dtype: float64
+
+# Polars: Keeps as integer with null
+pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]})  # dtype: Int64
+```
+
+## Operation Mappings
+
+### Data Selection
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Select column | `df["col"]` or `df.col` | `df.select("col")` or `df["col"]` |
+| Select multiple | `df[["a", "b"]]` | `df.select("a", "b")` |
+| Select by position | `df.iloc[:, 0:3]` | `df.select(pl.col(df.columns[0:3]))` |
+| Select by condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
+
+### Data Filtering
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Single condition | `df[df["age"] > 25]` | `df.filter(pl.col("age") > 25)` |
+| Multiple conditions | `df[(df["age"] > 25) & (df["city"] == "NY")]` | `df.filter(pl.col("age") > 25, pl.col("city") == "NY")` |
+| Query method | `df.query("age > 25")` | `df.filter(pl.col("age") > 25)` |
+| isin | `df[df["city"].isin(["NY", "LA"])]` | `df.filter(pl.col("city").is_in(["NY", "LA"]))` |
+| isna | `df[df["value"].isna()]` | `df.filter(pl.col("value").is_null())` |
+| notna | `df[df["value"].notna()]` | `df.filter(pl.col("value").is_not_null())` |
+
+### Adding/Modifying Columns
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Add column | `df["new"] = df["old"] * 2` | `df.with_columns(new=pl.col("old") * 2)` |
+| Multiple columns | `df.assign(a=..., b=...)` | `df.with_columns(a=..., b=...)` |
+| Conditional column | `np.where(condition, a, b)` | `pl.when(condition).then(a).otherwise(b)` |
+
+**Important difference - Parallel execution:**
+
+```python
+# Pandas: Sequential (lambda sees previous results)
+df.assign(
+    a=lambda df_: df_.value * 10,
+    b=lambda df_: df_.value * 100
+)
+
+# Polars: Parallel (all computed together)
+df.with_columns(
+    a=pl.col("value") * 10,
+    b=pl.col("value") * 100
+)
+```
+
+### Grouping and Aggregation
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Group by | `df.groupby("col")` | `df.group_by("col")` |
+| Agg single | `df.groupby("col")["val"].mean()` | `df.group_by("col").agg(pl.col("val").mean())` |
+| Agg multiple | `df.groupby("col").agg({"val": ["mean", "sum"]})` | `df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum())` |
+| Size | `df.groupby("col").size()` | `df.group_by("col").agg(pl.len())` |
+| Count | `df.groupby("col").count()` | `df.group_by("col").agg(pl.col("*").count())` |
+
+### Window Functions
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Transform | `df.groupby("col").transform("mean")` | `df.with_columns(pl.col("val").mean().over("col"))` |
+| Rank | `df.groupby("col")["val"].rank()` | `df.with_columns(pl.col("val").rank().over("col"))` |
+| Shift | `df.groupby("col")["val"].shift(1)` | `df.with_columns(pl.col("val").shift(1).over("col"))` |
+| Cumsum | `df.groupby("col")["val"].cumsum()` | `df.with_columns(pl.col("val").cum_sum().over("col"))` |
+
+### Joins
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Inner join | `df1.merge(df2, on="id")` | `df1.join(df2, on="id", how="inner")` |
+| Left join | `df1.merge(df2, on="id", how="left")` | `df1.join(df2, on="id", how="left")` |
+| Different keys | `df1.merge(df2, left_on="a", right_on="b")` | `df1.join(df2, left_on="a", right_on="b")` |
+
+### Concatenation
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Vertical | `pd.concat([df1, df2], axis=0)` | `pl.concat([df1, df2], how="vertical")` |
+| Horizontal | `pd.concat([df1, df2], axis=1)` | `pl.concat([df1, df2], how="horizontal")` |
+
+### Sorting
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Sort by column | `df.sort_values("col")` | `df.sort("col")` |
+| Descending | `df.sort_values("col", ascending=False)` | `df.sort("col", descending=True)` |
+| Multiple columns | `df.sort_values(["a", "b"])` | `df.sort("a", "b")` |
+
+### Reshaping
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Pivot | `df.pivot(index="a", columns="b", values="c")` | `df.pivot(values="c", index="a", columns="b")` |
+| Melt | `df.melt(id_vars="id")` | `df.unpivot(index="id")` |
+
+### I/O Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Read CSV | `pd.read_csv("file.csv")` | `pl.read_csv("file.csv")` or `pl.scan_csv()` |
+| Write CSV | `df.to_csv("file.csv")` | `df.write_csv("file.csv")` |
+| Read Parquet | `pd.read_parquet("file.parquet")` | `pl.read_parquet("file.parquet")` |
+| Write Parquet | `df.to_parquet("file.parquet")` | `df.write_parquet("file.parquet")` |
+| Read Excel | `pd.read_excel("file.xlsx")` | `pl.read_excel("file.xlsx")` |
+
+### String Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Upper | `df["col"].str.upper()` | `df.select(pl.col("col").str.to_uppercase())` |
+| Lower | `df["col"].str.lower()` | `df.select(pl.col("col").str.to_lowercase())` |
+| Contains | `df["col"].str.contains("pattern")` | `df.filter(pl.col("col").str.contains("pattern"))` |
+| Replace | `df["col"].str.replace("old", "new")` | `df.select(pl.col("col").str.replace("old", "new"))` |
+| Split | `df["col"].str.split(" ")` | `df.select(pl.col("col").str.split(" "))` |
+
+### Datetime Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Parse dates | `pd.to_datetime(df["col"])` | `df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d"))` |
+| Year | `df["date"].dt.year` | `df.select(pl.col("date").dt.year())` |
+| Month | `df["date"].dt.month` | `df.select(pl.col("date").dt.month())` |
+| Day | `df["date"].dt.day` | `df.select(pl.col("date").dt.day())` |
+
+### Missing Data
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Drop nulls | `df.dropna()` | `df.drop_nulls()` |
+| Fill nulls | `df.fillna(0)` | `df.fill_null(0)` |
+| Check null | `df["col"].isna()` | `df.select(pl.col("col").is_null())` |
+| Forward fill | `df.fillna(method="ffill")` | `df.select(pl.col("col").fill_null(strategy="forward"))` |
+
+### Other Operations
+
+| Operation | Pandas | Polars |
+|-----------|--------|--------|
+| Unique values | `df["col"].unique()` | `df["col"].unique()` |
+| Value counts | `df["col"].value_counts()` | `df["col"].value_counts()` |
+| Describe | `df.describe()` | `df.describe()` |
+| Sample | `df.sample(n=100)` | `df.sample(n=100)` |
+| Head | `df.head()` | `df.head()` |
+| Tail | `df.tail()` | `df.tail()` |
+
+## Common Migration Patterns
+
+### Pattern 1: Chained Operations
+
+**Pandas:**
+```python
+result = (df
+    .assign(new_col=lambda x: x["old_col"] * 2)
+    .query("new_col > 10")
+    .groupby("category")
+    .agg({"value": "sum"})
+    .reset_index()
+)
+```
+
+**Polars:**
+```python
+result = (df
+    .with_columns(new_col=pl.col("old_col") * 2)
+    .filter(pl.col("new_col") > 10)
+    .group_by("category")
+    .agg(pl.col("value").sum())
+)
+# No reset_index needed - Polars doesn't have index
+```
+
+### Pattern 2: Apply Functions
+
+**Pandas:**
+```python
+# Avoid in Polars - breaks parallelization
+df["result"] = df["value"].apply(lambda x: x * 2)
+```
+
+**Polars:**
+```python
+# Use expressions instead
+df = df.with_columns(result=pl.col("value") * 2)
+
+# If custom function needed
+df = df.with_columns(
+    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
+)
+```
+
+### Pattern 3: Conditional Column Creation
+
+**Pandas:**
+```python
+df["category"] = np.where(
+    df["value"] > 100,
+    "high",
+    np.where(df["value"] > 50, "medium", "low")
+)
+```
+
+**Polars:**
+```python
+df = df.with_columns(
+    category=pl.when(pl.col("value") > 100)
+        .then("high")
+        .when(pl.col("value") > 50)
+        .then("medium")
+        .otherwise("low")
+)
+```
+
+### Pattern 4: Group Transform
+
+**Pandas:**
+```python
+df["group_mean"] = df.groupby("category")["value"].transform("mean")
+```
+
+**Polars:**
+```python
+df = df.with_columns(
+    group_mean=pl.col("value").mean().over("category")
+)
+```
+
+### Pattern 5: Multiple Aggregations
+
+**Pandas:**
+```python
+result = df.groupby("category").agg({
+    "value": ["mean", "sum", "count"],
+    "price": ["min", "max"]
+})
+```
+
+**Polars:**
+```python
+result = df.group_by("category").agg(
+    pl.col("value").mean().alias("value_mean"),
+    pl.col("value").sum().alias("value_sum"),
+    pl.col("value").count().alias("value_count"),
+    pl.col("price").min().alias("price_min"),
+    pl.col("price").max().alias("price_max")
+)
+```
+
+## Performance Anti-Patterns to Avoid
+
+### Anti-Pattern 1: Sequential Pipe Operations
+
+**Bad (disables parallelization):**
+```python
+df = df.pipe(function1).pipe(function2).pipe(function3)
+```
+
+**Good (enables parallelization):**
+```python
+df = df.with_columns(
+    function1_result(),
+    function2_result(),
+    function3_result()
+)
+```
+
+### Anti-Pattern 2: Python Functions in Hot Paths
+
+**Bad:**
+```python
+df = df.with_columns(
+    result=pl.col("value").map_elements(lambda x: x * 2)
+)
+```
+
+**Good:**
+```python
+df = df.with_columns(result=pl.col("value") * 2)
+```
+
+### Anti-Pattern 3: Using Eager Reading for Large Files
+
+**Bad:**
+```python
+df = pl.read_csv("large_file.csv")
+result = df.filter(pl.col("age") > 25).select("name", "age")
+```
+
+**Good:**
+```python
+lf = pl.scan_csv("large_file.csv")
+result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
+```
+
+### Anti-Pattern 4: Row Iteration
+
+**Bad:**
+```python
+for row in df.iter_rows():
+    # Process row
+    pass
+```
+
+**Good:**
+```python
+# Use vectorized operations
+df = df.with_columns(
+    # Vectorized computation
+)
+```
+
+## Migration Checklist
+
+When migrating from pandas to Polars:
+
+1. **Remove index operations** - Use integer positions or group_by
+2. **Replace apply/map with expressions** - Use Polars native operations
+3. **Update column assignment** - Use `with_columns()` instead of direct assignment
+4. **Change groupby.transform to .over()** - Window functions work differently
+5. **Update string operations** - Use `.str.to_uppercase()` instead of `.str.upper()`
+6. **Add explicit type casts** - Polars won't silently convert types
+7. **Consider lazy evaluation** - Use `scan_*` instead of `read_*` for large data
+8. **Update aggregation syntax** - More explicit in Polars
+9. **Remove reset_index calls** - Not needed in Polars
+10. **Update conditional logic** - Use `when().then().otherwise()` pattern
+
+## Compatibility Layer
+
+For gradual migration, you can use both libraries:
+
+```python
+import pandas as pd
+import polars as pl
+
+# Convert pandas to Polars
+pl_df = pl.from_pandas(pd_df)
+
+# Convert Polars to pandas
+pd_df = pl_df.to_pandas()
+
+# Use Arrow for zero-copy (when possible)
+pl_df = pl.from_arrow(pd_df)
+pd_df = pl_df.to_arrow().to_pandas()
+```
+
+## When to Stick with Pandas
+
+Consider staying with pandas when:
+- Working with time series requiring complex index operations
+- Need extensive ecosystem support (some libraries only support pandas)
+- Team lacks Rust/Polars expertise
+- Data is small and performance isn't critical
+- Using advanced pandas features without Polars equivalents
+
+## When to Switch to Polars
+
+Switch to Polars when:
+- Performance is critical
+- Working with large datasets (>1GB)
+- Need lazy evaluation and query optimization
+- Want better type safety
+- Need parallel execution by default
+- Starting a new project