Files
gh-k-dense-ai-claude-scient…/skills/polars/references/core_concepts.md
2025-11-30 08:30:10 +08:00

8.2 KiB

Polars Core Concepts

Expressions

Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.

What are Expressions?

An expression describes a transformation on data. It only materializes (executes) within specific contexts:

  • select() - Select and transform columns
  • with_columns() - Add or modify columns
  • filter() - Filter rows
  • group_by().agg() - Aggregate data

Expression Syntax

Basic column reference:

pl.col("column_name")

Computed expressions:

# Arithmetic
pl.col("height") * 2
pl.col("price") + pl.col("tax")

# With alias
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")

# Method chaining
pl.col("name").str.to_uppercase().str.slice(0, 3)

Expression Contexts

Select context:

df.select(
    "name",  # Simple column name
    pl.col("age"),  # Expression
    (pl.col("age") * 12).alias("age_in_months")  # Computed expression
)

With_columns context:

df.with_columns(
    age_doubled=pl.col("age") * 2,
    name_upper=pl.col("name").str.to_uppercase()
)

Filter context:

df.filter(
    pl.col("age") > 25,
    pl.col("city").is_in(["NY", "LA", "SF"])
)

Group_by context:

df.group_by("department").agg(
    pl.col("salary").mean(),
    pl.col("employee_id").count()
)

Expression Expansion

Apply operations to multiple columns at once:

All columns:

df.select(pl.all() * 2)

Pattern matching:

# All columns ending with "_value"
df.select(pl.col("^.*_value$") * 100)

# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES) + 1)

Exclude patterns:

df.select(pl.all().exclude("id", "name"))

Expression Composition

Expressions can be stored and reused:

# Define reusable expressions
age_expression = pl.col("age") * 12
name_expression = pl.col("name").str.to_uppercase()

# Use in multiple contexts
df.select(age_expression, name_expression)
df.with_columns(age_months=age_expression)

Data Types

Polars has a strict type system based on Apache Arrow.

Core Data Types

Numeric:

  • Int8, Int16, Int32, Int64 - Signed integers
  • UInt8, UInt16, UInt32, UInt64 - Unsigned integers
  • Float32, Float64 - Floating point numbers

Text:

  • Utf8 / String - UTF-8 encoded strings
  • Categorical - Categorized strings (low cardinality)
  • Enum - Fixed set of string values

Temporal:

  • Date - Calendar date (no time)
  • Datetime - Date and time with optional timezone
  • Time - Time of day
  • Duration - Time duration/difference

Boolean:

  • Boolean - True/False values

Nested:

  • List - Variable-length lists
  • Array - Fixed-length arrays
  • Struct - Nested record structures

Other:

  • Binary - Binary data
  • Object - Python objects (avoid in production)
  • Null - Null type

Type Casting

Convert between types explicitly:

# Cast to different type
df.select(
    pl.col("age").cast(pl.Float64),
    pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("id").cast(pl.Utf8)
)

Null Handling

Polars uses consistent null handling across all types:

Check for nulls:

df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())

Fill nulls:

pl.col("value").fill_null(0)
pl.col("value").fill_null(strategy="forward")
pl.col("value").fill_null(strategy="backward")
pl.col("value").fill_null(strategy="mean")

Drop nulls:

df.drop_nulls()  # Drop any row with nulls
df.drop_nulls(subset=["col1", "col2"])  # Drop rows with nulls in specific columns

Categorical Data

Use categorical types for string columns with low cardinality (repeated values):

# Cast to categorical
df.with_columns(
    pl.col("category").cast(pl.Categorical)
)

# Benefits:
# - Reduced memory usage
# - Faster grouping and joining
# - Maintains order information

Lazy vs Eager Evaluation

Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).

Eager Evaluation (DataFrame)

Operations execute immediately:

import polars as pl

# DataFrame operations execute right away
df = pl.read_csv("data.csv")  # Reads file immediately
result = df.filter(pl.col("age") > 25)  # Filters immediately
final = result.select("name", "age")  # Selects immediately

When to use eager:

  • Small datasets that fit in memory
  • Interactive exploration in notebooks
  • Simple one-off operations
  • Immediate feedback needed

Lazy Evaluation (LazyFrame)

Operations build a query plan, optimized before execution:

import polars as pl

# LazyFrame operations build a query plan
lf = pl.scan_csv("data.csv")  # Doesn't read yet
lf2 = lf.filter(pl.col("age") > 25)  # Adds to plan
lf3 = lf2.select("name", "age")  # Adds to plan
df = lf3.collect()  # NOW executes optimized plan

When to use lazy:

  • Large datasets
  • Complex query pipelines
  • Only need subset of data
  • Performance is critical
  • Streaming required

Query Optimization

Polars automatically optimizes lazy queries:

Predicate Pushdown: Filter operations pushed to data source when possible:

# Only reads rows where age > 25 from CSV
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).collect()

Projection Pushdown: Only read needed columns from data source:

# Only reads "name" and "age" columns from CSV
lf = pl.scan_csv("data.csv")
result = lf.select("name", "age").collect()

Query Plan Inspection:

# View the optimized query plan
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age")
print(result.explain())  # Shows optimized plan

Streaming Mode

Process data larger than memory:

# Enable streaming for very large datasets
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("age") > 25).collect(streaming=True)

Streaming benefits:

  • Process data larger than RAM
  • Lower peak memory usage
  • Chunk-based processing
  • Automatic memory management

Streaming limitations:

  • Not all operations support streaming
  • May be slower for small data
  • Some operations require materializing entire dataset

Converting Between Eager and Lazy

Eager to Lazy:

df = pl.read_csv("data.csv")
lf = df.lazy()  # Convert to LazyFrame

Lazy to Eager:

lf = pl.scan_csv("data.csv")
df = lf.collect()  # Execute and return DataFrame

Memory Format

Polars uses Apache Arrow columnar memory format:

Benefits:

  • Zero-copy data sharing with other Arrow libraries
  • Efficient columnar operations
  • SIMD vectorization
  • Reduced memory overhead
  • Fast serialization

Implications:

  • Data stored column-wise, not row-wise
  • Column operations very fast
  • Random row access slower than pandas
  • Best for analytical workloads

Parallelization

Polars parallelizes operations automatically using Rust's concurrency:

What gets parallelized:

  • Aggregations within groups
  • Window functions
  • Most expression evaluations
  • File reading (multiple files)
  • Join operations

What to avoid for parallelization:

  • Python user-defined functions (UDFs)
  • Lambda functions in .map_elements()
  • Sequential .pipe() chains

Best practice:

# Good: Stays in expression API (parallelized)
df.with_columns(
    pl.col("value") * 10,
    pl.col("value").log(),
    pl.col("value").sqrt()
)

# Bad: Uses Python function (sequential)
df.with_columns(
    pl.col("value").map_elements(lambda x: x * 10)
)

Strict Type System

Polars enforces strict typing:

No silent conversions:

# This will error - can't mix types
# df.with_columns(pl.col("int_col") + "string")

# Must cast explicitly
df.with_columns(
    pl.col("int_col").cast(pl.Utf8) + "_suffix"
)

Benefits:

  • Prevents silent bugs
  • Predictable behavior
  • Better performance
  • Clearer code intent

Integer nulls: Unlike pandas, integer columns can have nulls without converting to float:

# In pandas: Int column with null becomes Float
# In polars: Int column with null stays Int (with null values)
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
# dtype: Int64 (not Float64)