Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/polars/references/core_concepts.md
+++ b/skills/polars/references/core_concepts.md
@@ -0,0 +1,378 @@
+# Polars Core Concepts
+
+## Expressions
+
+Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.
+
+### What are Expressions?
+
+An expression describes a transformation on data. It only materializes (executes) within specific contexts:
+- `select()` - Select and transform columns
+- `with_columns()` - Add or modify columns
+- `filter()` - Filter rows
+- `group_by().agg()` - Aggregate data
+
+### Expression Syntax
+
+**Basic column reference:**
+```python
+pl.col("column_name")
+```
+
+**Computed expressions:**
+```python
+# Arithmetic
+pl.col("height") * 2
+pl.col("price") + pl.col("tax")
+
+# With alias
+(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")
+
+# Method chaining
+pl.col("name").str.to_uppercase().str.slice(0, 3)
+```
+
+### Expression Contexts
+
+**Select context:**
+```python
+df.select(
+    "name",  # Simple column name
+    pl.col("age"),  # Expression
+    (pl.col("age") * 12).alias("age_in_months")  # Computed expression
+)
+```
+
+**With_columns context:**
+```python
+df.with_columns(
+    age_doubled=pl.col("age") * 2,
+    name_upper=pl.col("name").str.to_uppercase()
+)
+```
+
+**Filter context:**
+```python
+df.filter(
+    pl.col("age") > 25,
+    pl.col("city").is_in(["NY", "LA", "SF"])
+)
+```
+
+**Group_by context:**
+```python
+df.group_by("department").agg(
+    pl.col("salary").mean(),
+    pl.col("employee_id").count()
+)
+```
+
+### Expression Expansion
+
+Apply operations to multiple columns at once:
+
+**All columns:**
+```python
+df.select(pl.all() * 2)
+```
+
+**Pattern matching:**
+```python
+# All columns ending with "_value"
+df.select(pl.col("^.*_value$") * 100)
+
+# All numeric columns
+df.select(pl.col(pl.NUMERIC_DTYPES) + 1)
+```
+
+**Exclude patterns:**
+```python
+df.select(pl.all().exclude("id", "name"))
+```
+
+### Expression Composition
+
+Expressions can be stored and reused:
+
+```python
+# Define reusable expressions
+age_expression = pl.col("age") * 12
+name_expression = pl.col("name").str.to_uppercase()
+
+# Use in multiple contexts
+df.select(age_expression, name_expression)
+df.with_columns(age_months=age_expression)
+```
+
+## Data Types
+
+Polars has a strict type system based on Apache Arrow.
+
+### Core Data Types
+
+**Numeric:**
+- `Int8`, `Int16`, `Int32`, `Int64` - Signed integers
+- `UInt8`, `UInt16`, `UInt32`, `UInt64` - Unsigned integers
+- `Float32`, `Float64` - Floating point numbers
+
+**Text:**
+- `Utf8` / `String` - UTF-8 encoded strings
+- `Categorical` - Categorized strings (low cardinality)
+- `Enum` - Fixed set of string values
+
+**Temporal:**
+- `Date` - Calendar date (no time)
+- `Datetime` - Date and time with optional timezone
+- `Time` - Time of day
+- `Duration` - Time duration/difference
+
+**Boolean:**
+- `Boolean` - True/False values
+
+**Nested:**
+- `List` - Variable-length lists
+- `Array` - Fixed-length arrays
+- `Struct` - Nested record structures
+
+**Other:**
+- `Binary` - Binary data
+- `Object` - Python objects (avoid in production)
+- `Null` - Null type
+
+### Type Casting
+
+Convert between types explicitly:
+
+```python
+# Cast to different type
+df.select(
+    pl.col("age").cast(pl.Float64),
+    pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
+    pl.col("id").cast(pl.Utf8)
+)
+```
+
+### Null Handling
+
+Polars uses consistent null handling across all types:
+
+**Check for nulls:**
+```python
+df.filter(pl.col("value").is_null())
+df.filter(pl.col("value").is_not_null())
+```
+
+**Fill nulls:**
+```python
+pl.col("value").fill_null(0)
+pl.col("value").fill_null(strategy="forward")
+pl.col("value").fill_null(strategy="backward")
+pl.col("value").fill_null(strategy="mean")
+```
+
+**Drop nulls:**
+```python
+df.drop_nulls()  # Drop any row with nulls
+df.drop_nulls(subset=["col1", "col2"])  # Drop rows with nulls in specific columns
+```
+
+### Categorical Data
+
+Use categorical types for string columns with low cardinality (repeated values):
+
+```python
+# Cast to categorical
+df.with_columns(
+    pl.col("category").cast(pl.Categorical)
+)
+
+# Benefits:
+# - Reduced memory usage
+# - Faster grouping and joining
+# - Maintains order information
+```
+
+## Lazy vs Eager Evaluation
+
+Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).
+
+### Eager Evaluation (DataFrame)
+
+Operations execute immediately:
+
+```python
+import polars as pl
+
+# DataFrame operations execute right away
+df = pl.read_csv("data.csv")  # Reads file immediately
+result = df.filter(pl.col("age") > 25)  # Filters immediately
+final = result.select("name", "age")  # Selects immediately
+```
+
+**When to use eager:**
+- Small datasets that fit in memory
+- Interactive exploration in notebooks
+- Simple one-off operations
+- Immediate feedback needed
+
+### Lazy Evaluation (LazyFrame)
+
+Operations build a query plan, optimized before execution:
+
+```python
+import polars as pl
+
+# LazyFrame operations build a query plan
+lf = pl.scan_csv("data.csv")  # Doesn't read yet
+lf2 = lf.filter(pl.col("age") > 25)  # Adds to plan
+lf3 = lf2.select("name", "age")  # Adds to plan
+df = lf3.collect()  # NOW executes optimized plan
+```
+
+**When to use lazy:**
+- Large datasets
+- Complex query pipelines
+- Only need subset of data
+- Performance is critical
+- Streaming required
+
+### Query Optimization
+
+Polars automatically optimizes lazy queries:
+
+**Predicate Pushdown:**
+Filter operations pushed to data source when possible:
+```python
+# Only reads rows where age > 25 from CSV
+lf = pl.scan_csv("data.csv")
+result = lf.filter(pl.col("age") > 25).collect()
+```
+
+**Projection Pushdown:**
+Only read needed columns from data source:
+```python
+# Only reads "name" and "age" columns from CSV
+lf = pl.scan_csv("data.csv")
+result = lf.select("name", "age").collect()
+```
+
+**Query Plan Inspection:**
+```python
+# View the optimized query plan
+lf = pl.scan_csv("data.csv")
+result = lf.filter(pl.col("age") > 25).select("name", "age")
+print(result.explain())  # Shows optimized plan
+```
+
+### Streaming Mode
+
+Process data larger than memory:
+
+```python
+# Enable streaming for very large datasets
+lf = pl.scan_csv("very_large.csv")
+result = lf.filter(pl.col("age") > 25).collect(streaming=True)
+```
+
+**Streaming benefits:**
+- Process data larger than RAM
+- Lower peak memory usage
+- Chunk-based processing
+- Automatic memory management
+
+**Streaming limitations:**
+- Not all operations support streaming
+- May be slower for small data
+- Some operations require materializing entire dataset
+
+### Converting Between Eager and Lazy
+
+**Eager to Lazy:**
+```python
+df = pl.read_csv("data.csv")
+lf = df.lazy()  # Convert to LazyFrame
+```
+
+**Lazy to Eager:**
+```python
+lf = pl.scan_csv("data.csv")
+df = lf.collect()  # Execute and return DataFrame
+```
+
+## Memory Format
+
+Polars uses Apache Arrow columnar memory format:
+
+**Benefits:**
+- Zero-copy data sharing with other Arrow libraries
+- Efficient columnar operations
+- SIMD vectorization
+- Reduced memory overhead
+- Fast serialization
+
+**Implications:**
+- Data stored column-wise, not row-wise
+- Column operations very fast
+- Random row access slower than pandas
+- Best for analytical workloads
+
+## Parallelization
+
+Polars parallelizes operations automatically using Rust's concurrency:
+
+**What gets parallelized:**
+- Aggregations within groups
+- Window functions
+- Most expression evaluations
+- File reading (multiple files)
+- Join operations
+
+**What to avoid for parallelization:**
+- Python user-defined functions (UDFs)
+- Lambda functions in `.map_elements()`
+- Sequential `.pipe()` chains
+
+**Best practice:**
+```python
+# Good: Stays in expression API (parallelized)
+df.with_columns(
+    pl.col("value") * 10,
+    pl.col("value").log(),
+    pl.col("value").sqrt()
+)
+
+# Bad: Uses Python function (sequential)
+df.with_columns(
+    pl.col("value").map_elements(lambda x: x * 10)
+)
+```
+
+## Strict Type System
+
+Polars enforces strict typing:
+
+**No silent conversions:**
+```python
+# This will error - can't mix types
+# df.with_columns(pl.col("int_col") + "string")
+
+# Must cast explicitly
+df.with_columns(
+    pl.col("int_col").cast(pl.Utf8) + "_suffix"
+)
+```
+
+**Benefits:**
+- Prevents silent bugs
+- Predictable behavior
+- Better performance
+- Clearer code intent
+
+**Integer nulls:**
+Unlike pandas, integer columns can have nulls without converting to float:
+```python
+# In pandas: Int column with null becomes Float
+# In polars: Int column with null stays Int (with null values)
+df = pl.DataFrame({"int_col": [1, 2, None, 4]})
+# dtype: Int64 (not Float64)
+```