# Polars Core Concepts ## Expressions Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately. ### What are Expressions? An expression describes a transformation on data. It only materializes (executes) within specific contexts: - `select()` - Select and transform columns - `with_columns()` - Add or modify columns - `filter()` - Filter rows - `group_by().agg()` - Aggregate data ### Expression Syntax **Basic column reference:** ```python pl.col("column_name") ``` **Computed expressions:** ```python # Arithmetic pl.col("height") * 2 pl.col("price") + pl.col("tax") # With alias (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi") # Method chaining pl.col("name").str.to_uppercase().str.slice(0, 3) ``` ### Expression Contexts **Select context:** ```python df.select( "name", # Simple column name pl.col("age"), # Expression (pl.col("age") * 12).alias("age_in_months") # Computed expression ) ``` **With_columns context:** ```python df.with_columns( age_doubled=pl.col("age") * 2, name_upper=pl.col("name").str.to_uppercase() ) ``` **Filter context:** ```python df.filter( pl.col("age") > 25, pl.col("city").is_in(["NY", "LA", "SF"]) ) ``` **Group_by context:** ```python df.group_by("department").agg( pl.col("salary").mean(), pl.col("employee_id").count() ) ``` ### Expression Expansion Apply operations to multiple columns at once: **All columns:** ```python df.select(pl.all() * 2) ``` **Pattern matching:** ```python # All columns ending with "_value" df.select(pl.col("^.*_value$") * 100) # All numeric columns df.select(pl.col(pl.NUMERIC_DTYPES) + 1) ``` **Exclude patterns:** ```python df.select(pl.all().exclude("id", "name")) ``` ### Expression Composition Expressions can be stored and reused: ```python # Define reusable expressions age_expression = pl.col("age") * 12 name_expression = pl.col("name").str.to_uppercase() # Use in multiple contexts df.select(age_expression, name_expression) df.with_columns(age_months=age_expression) ``` ## Data Types Polars has a strict type system based on Apache Arrow. ### Core Data Types **Numeric:** - `Int8`, `Int16`, `Int32`, `Int64` - Signed integers - `UInt8`, `UInt16`, `UInt32`, `UInt64` - Unsigned integers - `Float32`, `Float64` - Floating point numbers **Text:** - `Utf8` / `String` - UTF-8 encoded strings - `Categorical` - Categorized strings (low cardinality) - `Enum` - Fixed set of string values **Temporal:** - `Date` - Calendar date (no time) - `Datetime` - Date and time with optional timezone - `Time` - Time of day - `Duration` - Time duration/difference **Boolean:** - `Boolean` - True/False values **Nested:** - `List` - Variable-length lists - `Array` - Fixed-length arrays - `Struct` - Nested record structures **Other:** - `Binary` - Binary data - `Object` - Python objects (avoid in production) - `Null` - Null type ### Type Casting Convert between types explicitly: ```python # Cast to different type df.select( pl.col("age").cast(pl.Float64), pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"), pl.col("id").cast(pl.Utf8) ) ``` ### Null Handling Polars uses consistent null handling across all types: **Check for nulls:** ```python df.filter(pl.col("value").is_null()) df.filter(pl.col("value").is_not_null()) ``` **Fill nulls:** ```python pl.col("value").fill_null(0) pl.col("value").fill_null(strategy="forward") pl.col("value").fill_null(strategy="backward") pl.col("value").fill_null(strategy="mean") ``` **Drop nulls:** ```python df.drop_nulls() # Drop any row with nulls df.drop_nulls(subset=["col1", "col2"]) # Drop rows with nulls in specific columns ``` ### Categorical Data Use categorical types for string columns with low cardinality (repeated values): ```python # Cast to categorical df.with_columns( pl.col("category").cast(pl.Categorical) ) # Benefits: # - Reduced memory usage # - Faster grouping and joining # - Maintains order information ``` ## Lazy vs Eager Evaluation Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame). ### Eager Evaluation (DataFrame) Operations execute immediately: ```python import polars as pl # DataFrame operations execute right away df = pl.read_csv("data.csv") # Reads file immediately result = df.filter(pl.col("age") > 25) # Filters immediately final = result.select("name", "age") # Selects immediately ``` **When to use eager:** - Small datasets that fit in memory - Interactive exploration in notebooks - Simple one-off operations - Immediate feedback needed ### Lazy Evaluation (LazyFrame) Operations build a query plan, optimized before execution: ```python import polars as pl # LazyFrame operations build a query plan lf = pl.scan_csv("data.csv") # Doesn't read yet lf2 = lf.filter(pl.col("age") > 25) # Adds to plan lf3 = lf2.select("name", "age") # Adds to plan df = lf3.collect() # NOW executes optimized plan ``` **When to use lazy:** - Large datasets - Complex query pipelines - Only need subset of data - Performance is critical - Streaming required ### Query Optimization Polars automatically optimizes lazy queries: **Predicate Pushdown:** Filter operations pushed to data source when possible: ```python # Only reads rows where age > 25 from CSV lf = pl.scan_csv("data.csv") result = lf.filter(pl.col("age") > 25).collect() ``` **Projection Pushdown:** Only read needed columns from data source: ```python # Only reads "name" and "age" columns from CSV lf = pl.scan_csv("data.csv") result = lf.select("name", "age").collect() ``` **Query Plan Inspection:** ```python # View the optimized query plan lf = pl.scan_csv("data.csv") result = lf.filter(pl.col("age") > 25).select("name", "age") print(result.explain()) # Shows optimized plan ``` ### Streaming Mode Process data larger than memory: ```python # Enable streaming for very large datasets lf = pl.scan_csv("very_large.csv") result = lf.filter(pl.col("age") > 25).collect(streaming=True) ``` **Streaming benefits:** - Process data larger than RAM - Lower peak memory usage - Chunk-based processing - Automatic memory management **Streaming limitations:** - Not all operations support streaming - May be slower for small data - Some operations require materializing entire dataset ### Converting Between Eager and Lazy **Eager to Lazy:** ```python df = pl.read_csv("data.csv") lf = df.lazy() # Convert to LazyFrame ``` **Lazy to Eager:** ```python lf = pl.scan_csv("data.csv") df = lf.collect() # Execute and return DataFrame ``` ## Memory Format Polars uses Apache Arrow columnar memory format: **Benefits:** - Zero-copy data sharing with other Arrow libraries - Efficient columnar operations - SIMD vectorization - Reduced memory overhead - Fast serialization **Implications:** - Data stored column-wise, not row-wise - Column operations very fast - Random row access slower than pandas - Best for analytical workloads ## Parallelization Polars parallelizes operations automatically using Rust's concurrency: **What gets parallelized:** - Aggregations within groups - Window functions - Most expression evaluations - File reading (multiple files) - Join operations **What to avoid for parallelization:** - Python user-defined functions (UDFs) - Lambda functions in `.map_elements()` - Sequential `.pipe()` chains **Best practice:** ```python # Good: Stays in expression API (parallelized) df.with_columns( pl.col("value") * 10, pl.col("value").log(), pl.col("value").sqrt() ) # Bad: Uses Python function (sequential) df.with_columns( pl.col("value").map_elements(lambda x: x * 10) ) ``` ## Strict Type System Polars enforces strict typing: **No silent conversions:** ```python # This will error - can't mix types # df.with_columns(pl.col("int_col") + "string") # Must cast explicitly df.with_columns( pl.col("int_col").cast(pl.Utf8) + "_suffix" ) ``` **Benefits:** - Prevents silent bugs - Predictable behavior - Better performance - Clearer code intent **Integer nulls:** Unlike pandas, integer columns can have nulls without converting to float: ```python # In pandas: Int column with null becomes Float # In polars: Int column with null stays Int (with null values) df = pl.DataFrame({"int_col": [1, 2, None, 4]}) # dtype: Int64 (not Float64) ```