12 KiB
Pandas to Polars Migration Guide
This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.
Core Conceptual Differences
1. No Index System
Pandas: Uses row-based indexing system
df.loc[0, "column"]
df.iloc[0:5]
df.set_index("id")
Polars: Uses integer positions only
df[0, "column"] # Row position, column name
df[0:5] # Row slice
# No set_index equivalent - use group_by instead
2. Memory Format
Pandas: Row-oriented NumPy arrays Polars: Columnar Apache Arrow format
Implications:
- Polars is faster for column operations
- Polars uses less memory
- Polars has better data sharing capabilities
3. Parallelization
Pandas: Primarily single-threaded (requires Dask for parallelism) Polars: Parallel by default using Rust's concurrency
4. Lazy Evaluation
Pandas: Only eager evaluation Polars: Both eager (DataFrame) and lazy (LazyFrame) with query optimization
5. Type Strictness
Pandas: Allows silent type conversions Polars: Strict typing, explicit casts required
Example:
# Pandas: Silently converts to float
pd_df["int_col"] = [1, 2, None, 4] # dtype: float64
# Polars: Keeps as integer with null
pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]}) # dtype: Int64
Operation Mappings
Data Selection
| Operation | Pandas | Polars |
|---|---|---|
| Select column | df["col"] or df.col |
df.select("col") or df["col"] |
| Select multiple | df[["a", "b"]] |
df.select("a", "b") |
| Select by position | df.iloc[:, 0:3] |
df.select(pl.col(df.columns[0:3])) |
| Select by condition | df[df["age"] > 25] |
df.filter(pl.col("age") > 25) |
Data Filtering
| Operation | Pandas | Polars |
|---|---|---|
| Single condition | df[df["age"] > 25] |
df.filter(pl.col("age") > 25) |
| Multiple conditions | df[(df["age"] > 25) & (df["city"] == "NY")] |
df.filter(pl.col("age") > 25, pl.col("city") == "NY") |
| Query method | df.query("age > 25") |
df.filter(pl.col("age") > 25) |
| isin | df[df["city"].isin(["NY", "LA"])] |
df.filter(pl.col("city").is_in(["NY", "LA"])) |
| isna | df[df["value"].isna()] |
df.filter(pl.col("value").is_null()) |
| notna | df[df["value"].notna()] |
df.filter(pl.col("value").is_not_null()) |
Adding/Modifying Columns
| Operation | Pandas | Polars |
|---|---|---|
| Add column | df["new"] = df["old"] * 2 |
df.with_columns(new=pl.col("old") * 2) |
| Multiple columns | df.assign(a=..., b=...) |
df.with_columns(a=..., b=...) |
| Conditional column | np.where(condition, a, b) |
pl.when(condition).then(a).otherwise(b) |
Important difference - Parallel execution:
# Pandas: Sequential (lambda sees previous results)
df.assign(
a=lambda df_: df_.value * 10,
b=lambda df_: df_.value * 100
)
# Polars: Parallel (all computed together)
df.with_columns(
a=pl.col("value") * 10,
b=pl.col("value") * 100
)
Grouping and Aggregation
| Operation | Pandas | Polars |
|---|---|---|
| Group by | df.groupby("col") |
df.group_by("col") |
| Agg single | df.groupby("col")["val"].mean() |
df.group_by("col").agg(pl.col("val").mean()) |
| Agg multiple | df.groupby("col").agg({"val": ["mean", "sum"]}) |
df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum()) |
| Size | df.groupby("col").size() |
df.group_by("col").agg(pl.len()) |
| Count | df.groupby("col").count() |
df.group_by("col").agg(pl.col("*").count()) |
Window Functions
| Operation | Pandas | Polars |
|---|---|---|
| Transform | df.groupby("col").transform("mean") |
df.with_columns(pl.col("val").mean().over("col")) |
| Rank | df.groupby("col")["val"].rank() |
df.with_columns(pl.col("val").rank().over("col")) |
| Shift | df.groupby("col")["val"].shift(1) |
df.with_columns(pl.col("val").shift(1).over("col")) |
| Cumsum | df.groupby("col")["val"].cumsum() |
df.with_columns(pl.col("val").cum_sum().over("col")) |
Joins
| Operation | Pandas | Polars |
|---|---|---|
| Inner join | df1.merge(df2, on="id") |
df1.join(df2, on="id", how="inner") |
| Left join | df1.merge(df2, on="id", how="left") |
df1.join(df2, on="id", how="left") |
| Different keys | df1.merge(df2, left_on="a", right_on="b") |
df1.join(df2, left_on="a", right_on="b") |
Concatenation
| Operation | Pandas | Polars |
|---|---|---|
| Vertical | pd.concat([df1, df2], axis=0) |
pl.concat([df1, df2], how="vertical") |
| Horizontal | pd.concat([df1, df2], axis=1) |
pl.concat([df1, df2], how="horizontal") |
Sorting
| Operation | Pandas | Polars |
|---|---|---|
| Sort by column | df.sort_values("col") |
df.sort("col") |
| Descending | df.sort_values("col", ascending=False) |
df.sort("col", descending=True) |
| Multiple columns | df.sort_values(["a", "b"]) |
df.sort("a", "b") |
Reshaping
| Operation | Pandas | Polars |
|---|---|---|
| Pivot | df.pivot(index="a", columns="b", values="c") |
df.pivot(values="c", index="a", columns="b") |
| Melt | df.melt(id_vars="id") |
df.unpivot(index="id") |
I/O Operations
| Operation | Pandas | Polars |
|---|---|---|
| Read CSV | pd.read_csv("file.csv") |
pl.read_csv("file.csv") or pl.scan_csv() |
| Write CSV | df.to_csv("file.csv") |
df.write_csv("file.csv") |
| Read Parquet | pd.read_parquet("file.parquet") |
pl.read_parquet("file.parquet") |
| Write Parquet | df.to_parquet("file.parquet") |
df.write_parquet("file.parquet") |
| Read Excel | pd.read_excel("file.xlsx") |
pl.read_excel("file.xlsx") |
String Operations
| Operation | Pandas | Polars |
|---|---|---|
| Upper | df["col"].str.upper() |
df.select(pl.col("col").str.to_uppercase()) |
| Lower | df["col"].str.lower() |
df.select(pl.col("col").str.to_lowercase()) |
| Contains | df["col"].str.contains("pattern") |
df.filter(pl.col("col").str.contains("pattern")) |
| Replace | df["col"].str.replace("old", "new") |
df.select(pl.col("col").str.replace("old", "new")) |
| Split | df["col"].str.split(" ") |
df.select(pl.col("col").str.split(" ")) |
Datetime Operations
| Operation | Pandas | Polars |
|---|---|---|
| Parse dates | pd.to_datetime(df["col"]) |
df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d")) |
| Year | df["date"].dt.year |
df.select(pl.col("date").dt.year()) |
| Month | df["date"].dt.month |
df.select(pl.col("date").dt.month()) |
| Day | df["date"].dt.day |
df.select(pl.col("date").dt.day()) |
Missing Data
| Operation | Pandas | Polars |
|---|---|---|
| Drop nulls | df.dropna() |
df.drop_nulls() |
| Fill nulls | df.fillna(0) |
df.fill_null(0) |
| Check null | df["col"].isna() |
df.select(pl.col("col").is_null()) |
| Forward fill | df.fillna(method="ffill") |
df.select(pl.col("col").fill_null(strategy="forward")) |
Other Operations
| Operation | Pandas | Polars |
|---|---|---|
| Unique values | df["col"].unique() |
df["col"].unique() |
| Value counts | df["col"].value_counts() |
df["col"].value_counts() |
| Describe | df.describe() |
df.describe() |
| Sample | df.sample(n=100) |
df.sample(n=100) |
| Head | df.head() |
df.head() |
| Tail | df.tail() |
df.tail() |
Common Migration Patterns
Pattern 1: Chained Operations
Pandas:
result = (df
.assign(new_col=lambda x: x["old_col"] * 2)
.query("new_col > 10")
.groupby("category")
.agg({"value": "sum"})
.reset_index()
)
Polars:
result = (df
.with_columns(new_col=pl.col("old_col") * 2)
.filter(pl.col("new_col") > 10)
.group_by("category")
.agg(pl.col("value").sum())
)
# No reset_index needed - Polars doesn't have index
Pattern 2: Apply Functions
Pandas:
# Avoid in Polars - breaks parallelization
df["result"] = df["value"].apply(lambda x: x * 2)
Polars:
# Use expressions instead
df = df.with_columns(result=pl.col("value") * 2)
# If custom function needed
df = df.with_columns(
result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)
Pattern 3: Conditional Column Creation
Pandas:
df["category"] = np.where(
df["value"] > 100,
"high",
np.where(df["value"] > 50, "medium", "low")
)
Polars:
df = df.with_columns(
category=pl.when(pl.col("value") > 100)
.then("high")
.when(pl.col("value") > 50)
.then("medium")
.otherwise("low")
)
Pattern 4: Group Transform
Pandas:
df["group_mean"] = df.groupby("category")["value"].transform("mean")
Polars:
df = df.with_columns(
group_mean=pl.col("value").mean().over("category")
)
Pattern 5: Multiple Aggregations
Pandas:
result = df.groupby("category").agg({
"value": ["mean", "sum", "count"],
"price": ["min", "max"]
})
Polars:
result = df.group_by("category").agg(
pl.col("value").mean().alias("value_mean"),
pl.col("value").sum().alias("value_sum"),
pl.col("value").count().alias("value_count"),
pl.col("price").min().alias("price_min"),
pl.col("price").max().alias("price_max")
)
Performance Anti-Patterns to Avoid
Anti-Pattern 1: Sequential Pipe Operations
Bad (disables parallelization):
df = df.pipe(function1).pipe(function2).pipe(function3)
Good (enables parallelization):
df = df.with_columns(
function1_result(),
function2_result(),
function3_result()
)
Anti-Pattern 2: Python Functions in Hot Paths
Bad:
df = df.with_columns(
result=pl.col("value").map_elements(lambda x: x * 2)
)
Good:
df = df.with_columns(result=pl.col("value") * 2)
Anti-Pattern 3: Using Eager Reading for Large Files
Bad:
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")
Good:
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()
Anti-Pattern 4: Row Iteration
Bad:
for row in df.iter_rows():
# Process row
pass
Good:
# Use vectorized operations
df = df.with_columns(
# Vectorized computation
)
Migration Checklist
When migrating from pandas to Polars:
- Remove index operations - Use integer positions or group_by
- Replace apply/map with expressions - Use Polars native operations
- Update column assignment - Use
with_columns()instead of direct assignment - Change groupby.transform to .over() - Window functions work differently
- Update string operations - Use
.str.to_uppercase()instead of.str.upper() - Add explicit type casts - Polars won't silently convert types
- Consider lazy evaluation - Use
scan_*instead ofread_*for large data - Update aggregation syntax - More explicit in Polars
- Remove reset_index calls - Not needed in Polars
- Update conditional logic - Use
when().then().otherwise()pattern
Compatibility Layer
For gradual migration, you can use both libraries:
import pandas as pd
import polars as pl
# Convert pandas to Polars
pl_df = pl.from_pandas(pd_df)
# Convert Polars to pandas
pd_df = pl_df.to_pandas()
# Use Arrow for zero-copy (when possible)
pl_df = pl.from_arrow(pd_df)
pd_df = pl_df.to_arrow().to_pandas()
When to Stick with Pandas
Consider staying with pandas when:
- Working with time series requiring complex index operations
- Need extensive ecosystem support (some libraries only support pandas)
- Team lacks Rust/Polars expertise
- Data is small and performance isn't critical
- Using advanced pandas features without Polars equivalents
When to Switch to Polars
Switch to Polars when:
- Performance is critical
- Working with large datasets (>1GB)
- Need lazy evaluation and query optimization
- Want better type safety
- Need parallel execution by default
- Starting a new project