Initial commit
This commit is contained in:
381
skills/polars/SKILL.md
Normal file
381
skills/polars/SKILL.md
Normal file
@@ -0,0 +1,381 @@
|
||||
---
|
||||
name: polars
|
||||
description: "Fast DataFrame library (Apache Arrow). Select, filter, group_by, joins, lazy evaluation, CSV/Parquet I/O, expression API, for high-performance data analysis workflows."
|
||||
---
|
||||
|
||||
# Polars
|
||||
|
||||
## Overview
|
||||
|
||||
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation and Basic Usage
|
||||
|
||||
Install Polars:
|
||||
```python
|
||||
uv pip install polars
|
||||
```
|
||||
|
||||
Basic DataFrame creation and operations:
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
# Create DataFrame
|
||||
df = pl.DataFrame({
|
||||
"name": ["Alice", "Bob", "Charlie"],
|
||||
"age": [25, 30, 35],
|
||||
"city": ["NY", "LA", "SF"]
|
||||
})
|
||||
|
||||
# Select columns
|
||||
df.select("name", "age")
|
||||
|
||||
# Filter rows
|
||||
df.filter(pl.col("age") > 25)
|
||||
|
||||
# Add computed columns
|
||||
df.with_columns(
|
||||
age_plus_10=pl.col("age") + 10
|
||||
)
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Expressions
|
||||
|
||||
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
|
||||
|
||||
**Key principles:**
|
||||
- Use `pl.col("column_name")` to reference columns
|
||||
- Chain methods to build complex transformations
|
||||
- Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Expression-based computation
|
||||
df.select(
|
||||
pl.col("name"),
|
||||
(pl.col("age") * 12).alias("age_in_months")
|
||||
)
|
||||
```
|
||||
|
||||
### Lazy vs Eager Evaluation
|
||||
|
||||
**Eager (DataFrame):** Operations execute immediately
|
||||
```python
|
||||
df = pl.read_csv("file.csv") # Reads immediately
|
||||
result = df.filter(pl.col("age") > 25) # Executes immediately
|
||||
```
|
||||
|
||||
**Lazy (LazyFrame):** Operations build a query plan, optimized before execution
|
||||
```python
|
||||
lf = pl.scan_csv("file.csv") # Doesn't read yet
|
||||
result = lf.filter(pl.col("age") > 25).select("name", "age")
|
||||
df = result.collect() # Now executes optimized query
|
||||
```
|
||||
|
||||
**When to use lazy:**
|
||||
- Working with large datasets
|
||||
- Complex query pipelines
|
||||
- When only some columns/rows are needed
|
||||
- Performance is critical
|
||||
|
||||
**Benefits of lazy evaluation:**
|
||||
- Automatic query optimization
|
||||
- Predicate pushdown
|
||||
- Projection pushdown
|
||||
- Parallel execution
|
||||
|
||||
For detailed concepts, load `references/core_concepts.md`.
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Select
|
||||
Select and manipulate columns:
|
||||
```python
|
||||
# Select specific columns
|
||||
df.select("name", "age")
|
||||
|
||||
# Select with expressions
|
||||
df.select(
|
||||
pl.col("name"),
|
||||
(pl.col("age") * 2).alias("double_age")
|
||||
)
|
||||
|
||||
# Select all columns matching a pattern
|
||||
df.select(pl.col("^.*_id$"))
|
||||
```
|
||||
|
||||
### Filter
|
||||
Filter rows by conditions:
|
||||
```python
|
||||
# Single condition
|
||||
df.filter(pl.col("age") > 25)
|
||||
|
||||
# Multiple conditions (cleaner than using &)
|
||||
df.filter(
|
||||
pl.col("age") > 25,
|
||||
pl.col("city") == "NY"
|
||||
)
|
||||
|
||||
# Complex conditions
|
||||
df.filter(
|
||||
(pl.col("age") > 25) | (pl.col("city") == "LA")
|
||||
)
|
||||
```
|
||||
|
||||
### With Columns
|
||||
Add or modify columns while preserving existing ones:
|
||||
```python
|
||||
# Add new columns
|
||||
df.with_columns(
|
||||
age_plus_10=pl.col("age") + 10,
|
||||
name_upper=pl.col("name").str.to_uppercase()
|
||||
)
|
||||
|
||||
# Parallel computation (all columns computed in parallel)
|
||||
df.with_columns(
|
||||
pl.col("value") * 10,
|
||||
pl.col("value") * 100,
|
||||
)
|
||||
```
|
||||
|
||||
### Group By and Aggregations
|
||||
Group data and compute aggregations:
|
||||
```python
|
||||
# Basic grouping
|
||||
df.group_by("city").agg(
|
||||
pl.col("age").mean().alias("avg_age"),
|
||||
pl.len().alias("count")
|
||||
)
|
||||
|
||||
# Multiple group keys
|
||||
df.group_by("city", "department").agg(
|
||||
pl.col("salary").sum()
|
||||
)
|
||||
|
||||
# Conditional aggregations
|
||||
df.group_by("city").agg(
|
||||
(pl.col("age") > 30).sum().alias("over_30")
|
||||
)
|
||||
```
|
||||
|
||||
For detailed operation patterns, load `references/operations.md`.
|
||||
|
||||
## Aggregations and Window Functions
|
||||
|
||||
### Aggregation Functions
|
||||
Common aggregations within `group_by` context:
|
||||
- `pl.len()` - count rows
|
||||
- `pl.col("x").sum()` - sum values
|
||||
- `pl.col("x").mean()` - average
|
||||
- `pl.col("x").min()` / `pl.col("x").max()` - extremes
|
||||
- `pl.first()` / `pl.last()` - first/last values
|
||||
|
||||
### Window Functions with `over()`
|
||||
Apply aggregations while preserving row count:
|
||||
```python
|
||||
# Add group statistics to each row
|
||||
df.with_columns(
|
||||
avg_age_by_city=pl.col("age").mean().over("city"),
|
||||
rank_in_city=pl.col("salary").rank().over("city")
|
||||
)
|
||||
|
||||
# Multiple grouping columns
|
||||
df.with_columns(
|
||||
group_avg=pl.col("value").mean().over("category", "region")
|
||||
)
|
||||
```
|
||||
|
||||
**Mapping strategies:**
|
||||
- `group_to_rows` (default): Preserves original row order
|
||||
- `explode`: Faster but groups rows together
|
||||
- `join`: Creates list columns
|
||||
|
||||
## Data I/O
|
||||
|
||||
### Supported Formats
|
||||
Polars supports reading and writing:
|
||||
- CSV, Parquet, JSON, Excel
|
||||
- Databases (via connectors)
|
||||
- Cloud storage (S3, Azure, GCS)
|
||||
- Google BigQuery
|
||||
- Multiple/partitioned files
|
||||
|
||||
### Common I/O Operations
|
||||
|
||||
**CSV:**
|
||||
```python
|
||||
# Eager
|
||||
df = pl.read_csv("file.csv")
|
||||
df.write_csv("output.csv")
|
||||
|
||||
# Lazy (preferred for large files)
|
||||
lf = pl.scan_csv("file.csv")
|
||||
result = lf.filter(...).select(...).collect()
|
||||
```
|
||||
|
||||
**Parquet (recommended for performance):**
|
||||
```python
|
||||
df = pl.read_parquet("file.parquet")
|
||||
df.write_parquet("output.parquet")
|
||||
```
|
||||
|
||||
**JSON:**
|
||||
```python
|
||||
df = pl.read_json("file.json")
|
||||
df.write_json("output.json")
|
||||
```
|
||||
|
||||
For comprehensive I/O documentation, load `references/io_guide.md`.
|
||||
|
||||
## Transformations
|
||||
|
||||
### Joins
|
||||
Combine DataFrames:
|
||||
```python
|
||||
# Inner join
|
||||
df1.join(df2, on="id", how="inner")
|
||||
|
||||
# Left join
|
||||
df1.join(df2, on="id", how="left")
|
||||
|
||||
# Join on different column names
|
||||
df1.join(df2, left_on="user_id", right_on="id")
|
||||
```
|
||||
|
||||
### Concatenation
|
||||
Stack DataFrames:
|
||||
```python
|
||||
# Vertical (stack rows)
|
||||
pl.concat([df1, df2], how="vertical")
|
||||
|
||||
# Horizontal (add columns)
|
||||
pl.concat([df1, df2], how="horizontal")
|
||||
|
||||
# Diagonal (union with different schemas)
|
||||
pl.concat([df1, df2], how="diagonal")
|
||||
```
|
||||
|
||||
### Pivot and Unpivot
|
||||
Reshape data:
|
||||
```python
|
||||
# Pivot (wide format)
|
||||
df.pivot(values="sales", index="date", columns="product")
|
||||
|
||||
# Unpivot (long format)
|
||||
df.unpivot(index="id", on=["col1", "col2"])
|
||||
```
|
||||
|
||||
For detailed transformation examples, load `references/transformations.md`.
|
||||
|
||||
## Pandas Migration
|
||||
|
||||
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
|
||||
|
||||
### Conceptual Differences
|
||||
- **No index**: Polars uses integer positions only
|
||||
- **Strict typing**: No silent type conversions
|
||||
- **Lazy evaluation**: Available via LazyFrame
|
||||
- **Parallel by default**: Operations parallelized automatically
|
||||
|
||||
### Common Operation Mappings
|
||||
|
||||
| Operation | Pandas | Polars |
|
||||
|-----------|--------|--------|
|
||||
| Select column | `df["col"]` | `df.select("col")` |
|
||||
| Filter | `df[df["col"] > 10]` | `df.filter(pl.col("col") > 10)` |
|
||||
| Add column | `df.assign(x=...)` | `df.with_columns(x=...)` |
|
||||
| Group by | `df.groupby("col").agg(...)` | `df.group_by("col").agg(...)` |
|
||||
| Window | `df.groupby("col").transform(...)` | `df.with_columns(...).over("col")` |
|
||||
|
||||
### Key Syntax Patterns
|
||||
|
||||
**Pandas sequential (slow):**
|
||||
```python
|
||||
df.assign(
|
||||
col_a=lambda df_: df_.value * 10,
|
||||
col_b=lambda df_: df_.value * 100
|
||||
)
|
||||
```
|
||||
|
||||
**Polars parallel (fast):**
|
||||
```python
|
||||
df.with_columns(
|
||||
col_a=pl.col("value") * 10,
|
||||
col_b=pl.col("value") * 100,
|
||||
)
|
||||
```
|
||||
|
||||
For comprehensive migration guide, load `references/pandas_migration.md`.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Use lazy evaluation for large datasets:**
|
||||
```python
|
||||
lf = pl.scan_csv("large.csv") # Don't use read_csv
|
||||
result = lf.filter(...).select(...).collect()
|
||||
```
|
||||
|
||||
2. **Avoid Python functions in hot paths:**
|
||||
- Stay within expression API for parallelization
|
||||
- Use `.map_elements()` only when necessary
|
||||
- Prefer native Polars operations
|
||||
|
||||
3. **Use streaming for very large data:**
|
||||
```python
|
||||
lf.collect(streaming=True)
|
||||
```
|
||||
|
||||
4. **Select only needed columns early:**
|
||||
```python
|
||||
# Good: Select columns early
|
||||
lf.select("col1", "col2").filter(...)
|
||||
|
||||
# Bad: Filter on all columns first
|
||||
lf.filter(...).select("col1", "col2")
|
||||
```
|
||||
|
||||
5. **Use appropriate data types:**
|
||||
- Categorical for low-cardinality strings
|
||||
- Appropriate integer sizes (i32 vs i64)
|
||||
- Date types for temporal data
|
||||
|
||||
### Expression Patterns
|
||||
|
||||
**Conditional operations:**
|
||||
```python
|
||||
pl.when(condition).then(value).otherwise(other_value)
|
||||
```
|
||||
|
||||
**Column operations across multiple columns:**
|
||||
```python
|
||||
df.select(pl.col("^.*_value$") * 2) # Regex pattern
|
||||
```
|
||||
|
||||
**Null handling:**
|
||||
```python
|
||||
pl.col("x").fill_null(0)
|
||||
pl.col("x").is_null()
|
||||
pl.col("x").drop_nulls()
|
||||
```
|
||||
|
||||
For additional best practices and patterns, load `references/best_practices.md`.
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
### references/
|
||||
- `core_concepts.md` - Detailed explanations of expressions, lazy evaluation, and type system
|
||||
- `operations.md` - Comprehensive guide to all common operations with examples
|
||||
- `pandas_migration.md` - Complete migration guide from pandas to Polars
|
||||
- `io_guide.md` - Data I/O operations for all supported formats
|
||||
- `transformations.md` - Joins, concatenation, pivots, and reshaping operations
|
||||
- `best_practices.md` - Performance optimization tips and common patterns
|
||||
|
||||
Load these references as needed when users require detailed information about specific topics.
|
||||
Reference in New Issue
Block a user