Files
gh-k-dense-ai-claude-scient…/skills/polars/references/pandas_migration.md
2025-11-30 08:30:10 +08:00

12 KiB

Pandas to Polars Migration Guide

This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.

Core Conceptual Differences

1. No Index System

Pandas: Uses row-based indexing system

df.loc[0, "column"]
df.iloc[0:5]
df.set_index("id")

Polars: Uses integer positions only

df[0, "column"]  # Row position, column name
df[0:5]  # Row slice
# No set_index equivalent - use group_by instead

2. Memory Format

Pandas: Row-oriented NumPy arrays Polars: Columnar Apache Arrow format

Implications:

  • Polars is faster for column operations
  • Polars uses less memory
  • Polars has better data sharing capabilities

3. Parallelization

Pandas: Primarily single-threaded (requires Dask for parallelism) Polars: Parallel by default using Rust's concurrency

4. Lazy Evaluation

Pandas: Only eager evaluation Polars: Both eager (DataFrame) and lazy (LazyFrame) with query optimization

5. Type Strictness

Pandas: Allows silent type conversions Polars: Strict typing, explicit casts required

Example:

# Pandas: Silently converts to float
pd_df["int_col"] = [1, 2, None, 4]  # dtype: float64

# Polars: Keeps as integer with null
pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]})  # dtype: Int64

Operation Mappings

Data Selection

Operation Pandas Polars
Select column df["col"] or df.col df.select("col") or df["col"]
Select multiple df[["a", "b"]] df.select("a", "b")
Select by position df.iloc[:, 0:3] df.select(pl.col(df.columns[0:3]))
Select by condition df[df["age"] > 25] df.filter(pl.col("age") > 25)

Data Filtering

Operation Pandas Polars
Single condition df[df["age"] > 25] df.filter(pl.col("age") > 25)
Multiple conditions df[(df["age"] > 25) & (df["city"] == "NY")] df.filter(pl.col("age") > 25, pl.col("city") == "NY")
Query method df.query("age > 25") df.filter(pl.col("age") > 25)
isin df[df["city"].isin(["NY", "LA"])] df.filter(pl.col("city").is_in(["NY", "LA"]))
isna df[df["value"].isna()] df.filter(pl.col("value").is_null())
notna df[df["value"].notna()] df.filter(pl.col("value").is_not_null())

Adding/Modifying Columns

Operation Pandas Polars
Add column df["new"] = df["old"] * 2 df.with_columns(new=pl.col("old") * 2)
Multiple columns df.assign(a=..., b=...) df.with_columns(a=..., b=...)
Conditional column np.where(condition, a, b) pl.when(condition).then(a).otherwise(b)

Important difference - Parallel execution:

# Pandas: Sequential (lambda sees previous results)
df.assign(
    a=lambda df_: df_.value * 10,
    b=lambda df_: df_.value * 100
)

# Polars: Parallel (all computed together)
df.with_columns(
    a=pl.col("value") * 10,
    b=pl.col("value") * 100
)

Grouping and Aggregation

Operation Pandas Polars
Group by df.groupby("col") df.group_by("col")
Agg single df.groupby("col")["val"].mean() df.group_by("col").agg(pl.col("val").mean())
Agg multiple df.groupby("col").agg({"val": ["mean", "sum"]}) df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum())
Size df.groupby("col").size() df.group_by("col").agg(pl.len())
Count df.groupby("col").count() df.group_by("col").agg(pl.col("*").count())

Window Functions

Operation Pandas Polars
Transform df.groupby("col").transform("mean") df.with_columns(pl.col("val").mean().over("col"))
Rank df.groupby("col")["val"].rank() df.with_columns(pl.col("val").rank().over("col"))
Shift df.groupby("col")["val"].shift(1) df.with_columns(pl.col("val").shift(1).over("col"))
Cumsum df.groupby("col")["val"].cumsum() df.with_columns(pl.col("val").cum_sum().over("col"))

Joins

Operation Pandas Polars
Inner join df1.merge(df2, on="id") df1.join(df2, on="id", how="inner")
Left join df1.merge(df2, on="id", how="left") df1.join(df2, on="id", how="left")
Different keys df1.merge(df2, left_on="a", right_on="b") df1.join(df2, left_on="a", right_on="b")

Concatenation

Operation Pandas Polars
Vertical pd.concat([df1, df2], axis=0) pl.concat([df1, df2], how="vertical")
Horizontal pd.concat([df1, df2], axis=1) pl.concat([df1, df2], how="horizontal")

Sorting

Operation Pandas Polars
Sort by column df.sort_values("col") df.sort("col")
Descending df.sort_values("col", ascending=False) df.sort("col", descending=True)
Multiple columns df.sort_values(["a", "b"]) df.sort("a", "b")

Reshaping

Operation Pandas Polars
Pivot df.pivot(index="a", columns="b", values="c") df.pivot(values="c", index="a", columns="b")
Melt df.melt(id_vars="id") df.unpivot(index="id")

I/O Operations

Operation Pandas Polars
Read CSV pd.read_csv("file.csv") pl.read_csv("file.csv") or pl.scan_csv()
Write CSV df.to_csv("file.csv") df.write_csv("file.csv")
Read Parquet pd.read_parquet("file.parquet") pl.read_parquet("file.parquet")
Write Parquet df.to_parquet("file.parquet") df.write_parquet("file.parquet")
Read Excel pd.read_excel("file.xlsx") pl.read_excel("file.xlsx")

String Operations

Operation Pandas Polars
Upper df["col"].str.upper() df.select(pl.col("col").str.to_uppercase())
Lower df["col"].str.lower() df.select(pl.col("col").str.to_lowercase())
Contains df["col"].str.contains("pattern") df.filter(pl.col("col").str.contains("pattern"))
Replace df["col"].str.replace("old", "new") df.select(pl.col("col").str.replace("old", "new"))
Split df["col"].str.split(" ") df.select(pl.col("col").str.split(" "))

Datetime Operations

Operation Pandas Polars
Parse dates pd.to_datetime(df["col"]) df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d"))
Year df["date"].dt.year df.select(pl.col("date").dt.year())
Month df["date"].dt.month df.select(pl.col("date").dt.month())
Day df["date"].dt.day df.select(pl.col("date").dt.day())

Missing Data

Operation Pandas Polars
Drop nulls df.dropna() df.drop_nulls()
Fill nulls df.fillna(0) df.fill_null(0)
Check null df["col"].isna() df.select(pl.col("col").is_null())
Forward fill df.fillna(method="ffill") df.select(pl.col("col").fill_null(strategy="forward"))

Other Operations

Operation Pandas Polars
Unique values df["col"].unique() df["col"].unique()
Value counts df["col"].value_counts() df["col"].value_counts()
Describe df.describe() df.describe()
Sample df.sample(n=100) df.sample(n=100)
Head df.head() df.head()
Tail df.tail() df.tail()

Common Migration Patterns

Pattern 1: Chained Operations

Pandas:

result = (df
    .assign(new_col=lambda x: x["old_col"] * 2)
    .query("new_col > 10")
    .groupby("category")
    .agg({"value": "sum"})
    .reset_index()
)

Polars:

result = (df
    .with_columns(new_col=pl.col("old_col") * 2)
    .filter(pl.col("new_col") > 10)
    .group_by("category")
    .agg(pl.col("value").sum())
)
# No reset_index needed - Polars doesn't have index

Pattern 2: Apply Functions

Pandas:

# Avoid in Polars - breaks parallelization
df["result"] = df["value"].apply(lambda x: x * 2)

Polars:

# Use expressions instead
df = df.with_columns(result=pl.col("value") * 2)

# If custom function needed
df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)

Pattern 3: Conditional Column Creation

Pandas:

df["category"] = np.where(
    df["value"] > 100,
    "high",
    np.where(df["value"] > 50, "medium", "low")
)

Polars:

df = df.with_columns(
    category=pl.when(pl.col("value") > 100)
        .then("high")
        .when(pl.col("value") > 50)
        .then("medium")
        .otherwise("low")
)

Pattern 4: Group Transform

Pandas:

df["group_mean"] = df.groupby("category")["value"].transform("mean")

Polars:

df = df.with_columns(
    group_mean=pl.col("value").mean().over("category")
)

Pattern 5: Multiple Aggregations

Pandas:

result = df.groupby("category").agg({
    "value": ["mean", "sum", "count"],
    "price": ["min", "max"]
})

Polars:

result = df.group_by("category").agg(
    pl.col("value").mean().alias("value_mean"),
    pl.col("value").sum().alias("value_sum"),
    pl.col("value").count().alias("value_count"),
    pl.col("price").min().alias("price_min"),
    pl.col("price").max().alias("price_max")
)

Performance Anti-Patterns to Avoid

Anti-Pattern 1: Sequential Pipe Operations

Bad (disables parallelization):

df = df.pipe(function1).pipe(function2).pipe(function3)

Good (enables parallelization):

df = df.with_columns(
    function1_result(),
    function2_result(),
    function3_result()
)

Anti-Pattern 2: Python Functions in Hot Paths

Bad:

df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2)
)

Good:

df = df.with_columns(result=pl.col("value") * 2)

Anti-Pattern 3: Using Eager Reading for Large Files

Bad:

df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")

Good:

lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()

Anti-Pattern 4: Row Iteration

Bad:

for row in df.iter_rows():
    # Process row
    pass

Good:

# Use vectorized operations
df = df.with_columns(
    # Vectorized computation
)

Migration Checklist

When migrating from pandas to Polars:

  1. Remove index operations - Use integer positions or group_by
  2. Replace apply/map with expressions - Use Polars native operations
  3. Update column assignment - Use with_columns() instead of direct assignment
  4. Change groupby.transform to .over() - Window functions work differently
  5. Update string operations - Use .str.to_uppercase() instead of .str.upper()
  6. Add explicit type casts - Polars won't silently convert types
  7. Consider lazy evaluation - Use scan_* instead of read_* for large data
  8. Update aggregation syntax - More explicit in Polars
  9. Remove reset_index calls - Not needed in Polars
  10. Update conditional logic - Use when().then().otherwise() pattern

Compatibility Layer

For gradual migration, you can use both libraries:

import pandas as pd
import polars as pl

# Convert pandas to Polars
pl_df = pl.from_pandas(pd_df)

# Convert Polars to pandas
pd_df = pl_df.to_pandas()

# Use Arrow for zero-copy (when possible)
pl_df = pl.from_arrow(pd_df)
pd_df = pl_df.to_arrow().to_pandas()

When to Stick with Pandas

Consider staying with pandas when:

  • Working with time series requiring complex index operations
  • Need extensive ecosystem support (some libraries only support pandas)
  • Team lacks Rust/Polars expertise
  • Data is small and performance isn't critical
  • Using advanced pandas features without Polars equivalents

When to Switch to Polars

Switch to Polars when:

  • Performance is critical
  • Working with large datasets (>1GB)
  • Need lazy evaluation and query optimization
  • Want better type safety
  • Need parallel execution by default
  • Starting a new project