Initial commit
This commit is contained in:
359
skills/pyspark-patterns.md
Normal file
359
skills/pyspark-patterns.md
Normal file
@@ -0,0 +1,359 @@
|
||||
---
|
||||
name: pyspark-patterns
|
||||
description: PySpark best practices, TableUtilities methods, ETL patterns, logging standards, and DataFrame operations for this project. Use when writing or debugging PySpark code.
|
||||
---
|
||||
|
||||
# PySpark Patterns & Best Practices
|
||||
|
||||
Comprehensive guide to PySpark patterns used in the Unify data migration project.
|
||||
|
||||
## Core Principle
|
||||
|
||||
**Always use DataFrame operations over raw SQL** when possible.
|
||||
|
||||
## TableUtilities Class Methods
|
||||
|
||||
Central utility class providing standardized DataFrame operations.
|
||||
|
||||
### add_row_hash()
|
||||
Add hash column for change detection and deduplication.
|
||||
|
||||
```python
|
||||
table_utilities = TableUtilities()
|
||||
df_with_hash = table_utilities.add_row_hash(df)
|
||||
```
|
||||
|
||||
### save_as_table()
|
||||
Standard table save with timestamp conversion and automatic filtering.
|
||||
|
||||
```python
|
||||
table_utilities.save_as_table(df, "database.table_name")
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Converts timestamp columns automatically
|
||||
- Filters to last N years when `date_created` column exists (controlled by `NUMBER_OF_YEARS`)
|
||||
- Prevents full dataset processing in local development
|
||||
|
||||
### clean_date_time_columns()
|
||||
Intelligent timestamp parsing for various date formats.
|
||||
|
||||
```python
|
||||
df_cleaned = table_utilities.clean_date_time_columns(df)
|
||||
```
|
||||
|
||||
### Deduplication Methods
|
||||
|
||||
**Simple deduplication** (all columns):
|
||||
```python
|
||||
df_deduped = table_utilities.drop_duplicates_simple(df)
|
||||
```
|
||||
|
||||
**Advanced deduplication** (specific columns, ordering):
|
||||
```python
|
||||
df_deduped = table_utilities.drop_duplicates_advanced(
|
||||
df,
|
||||
partition_columns=["id"],
|
||||
order_columns=["date_created"]
|
||||
)
|
||||
```
|
||||
|
||||
### filter_and_drop_column()
|
||||
Remove duplicate flags after processing.
|
||||
|
||||
```python
|
||||
df_filtered = table_utilities.filter_and_drop_column(df, "is_duplicate")
|
||||
```
|
||||
|
||||
### generate_deduplicate()
|
||||
Compare with existing table and identify new/changed records.
|
||||
|
||||
```python
|
||||
df_new = table_utilities.generate_deduplicate(df, "database.existing_table")
|
||||
```
|
||||
|
||||
### generate_unique_ids()
|
||||
Generate auto-incrementing unique identifiers.
|
||||
|
||||
```python
|
||||
df_with_id = table_utilities.generate_unique_ids(df, "unique_id_column_name")
|
||||
```
|
||||
|
||||
## ETL Class Pattern
|
||||
|
||||
All silver and gold transformations follow this standardized pattern:
|
||||
|
||||
```python
|
||||
class TableName:
|
||||
def __init__(self, bronze_table_name: str):
|
||||
self.bronze_table_name = bronze_table_name
|
||||
self.silver_database_name = f"silver_{self.bronze_table_name.split('.')[0].split('_')[-1]}"
|
||||
self.silver_table_name = self.bronze_table_name.split(".")[-1].replace("b_", "s_")
|
||||
|
||||
# Execute ETL pipeline
|
||||
self.extract_sdf = self.extract()
|
||||
self.transform_sdf = self.transform()
|
||||
self.load()
|
||||
|
||||
@synapse_error_print_handler
|
||||
def extract(self) -> DataFrame:
|
||||
"""Extract data from source tables."""
|
||||
logger.info(f"Extracting from {self.bronze_table_name}")
|
||||
df = spark.table(self.bronze_table_name)
|
||||
logger.success(f"Extracted {df.count()} records")
|
||||
return df
|
||||
|
||||
@synapse_error_print_handler
|
||||
def transform(self) -> DataFrame:
|
||||
"""Transform data according to business rules."""
|
||||
logger.info("Starting transformation")
|
||||
# Apply transformations
|
||||
transformed_df = self.extract_sdf.filter(...).select(...)
|
||||
logger.success("Transformation complete")
|
||||
return transformed_df
|
||||
|
||||
@synapse_error_print_handler
|
||||
def load(self) -> None:
|
||||
"""Load data to target table."""
|
||||
logger.info(f"Loading to {self.silver_database_name}.{self.silver_table_name}")
|
||||
table_utilities.save_as_table(
|
||||
self.transform_sdf,
|
||||
f"{self.silver_database_name}.{self.silver_table_name}"
|
||||
)
|
||||
logger.success(f"Successfully loaded {self.silver_table_name}")
|
||||
|
||||
|
||||
# Instantiate with exception handling
|
||||
try:
|
||||
TableName("bronze_database.b_table_name")
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing TableName: {str(e)}")
|
||||
raise e
|
||||
```
|
||||
|
||||
## Logging Standards
|
||||
|
||||
### Use NotebookLogger (Never print())
|
||||
|
||||
```python
|
||||
from utilities.session_optimiser import NotebookLogger
|
||||
|
||||
logger = NotebookLogger()
|
||||
|
||||
# Log levels
|
||||
logger.info("Starting process") # Informational messages
|
||||
logger.warning("Potential issue detected") # Warnings
|
||||
logger.error("Operation failed") # Errors
|
||||
logger.success("Process completed") # Success messages
|
||||
```
|
||||
|
||||
### Logging Best Practices
|
||||
|
||||
1. **Always include table/database names**:
|
||||
```python
|
||||
logger.info(f"Processing table {database}.{table}")
|
||||
```
|
||||
|
||||
2. **Log at key milestones**:
|
||||
```python
|
||||
logger.info("Starting extraction")
|
||||
# ... extraction code
|
||||
logger.success("Extraction complete")
|
||||
```
|
||||
|
||||
3. **Include counts and metrics**:
|
||||
```python
|
||||
logger.info(f"Extracted {df.count()} records from {table}")
|
||||
```
|
||||
|
||||
4. **Error context**:
|
||||
```python
|
||||
logger.error(f"Failed to process {table}: {str(e)}")
|
||||
```
|
||||
|
||||
## Error Handling Pattern
|
||||
|
||||
### @synapse_error_print_handler Decorator
|
||||
|
||||
Wrap ALL processing functions with this decorator:
|
||||
|
||||
```python
|
||||
from utilities.session_optimiser import synapse_error_print_handler
|
||||
|
||||
@synapse_error_print_handler
|
||||
def extract(self) -> DataFrame:
|
||||
# Your code here
|
||||
return df
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Consistent error handling across codebase
|
||||
- Automatic error logging
|
||||
- Graceful error propagation
|
||||
|
||||
### Exception Handling at Instantiation
|
||||
|
||||
```python
|
||||
try:
|
||||
MyETLClass("source_table")
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing MyETLClass: {str(e)}")
|
||||
raise e
|
||||
```
|
||||
|
||||
## DataFrame Operations Patterns
|
||||
|
||||
### Filtering
|
||||
```python
|
||||
# Use col() for clarity
|
||||
from pyspark.sql.functions import col
|
||||
|
||||
df_filtered = df.filter(col("status") == "active")
|
||||
df_filtered = df.filter((col("age") > 18) & (col("country") == "AU"))
|
||||
```
|
||||
|
||||
### Selecting and Aliasing
|
||||
```python
|
||||
from pyspark.sql.functions import col, lit
|
||||
|
||||
df_selected = df.select(
|
||||
col("id"),
|
||||
col("name").alias("person_name"),
|
||||
lit("constant_value").alias("constant_column")
|
||||
)
|
||||
```
|
||||
|
||||
### Joins
|
||||
```python
|
||||
# Always use explicit join keys and type
|
||||
df_joined = df1.join(
|
||||
df2,
|
||||
df1["id"] == df2["person_id"],
|
||||
"inner" # inner, left, right, outer
|
||||
)
|
||||
|
||||
# Drop duplicate columns after join
|
||||
df_joined = df_joined.drop(df2["person_id"])
|
||||
```
|
||||
|
||||
### Window Functions
|
||||
```python
|
||||
from pyspark.sql import Window
|
||||
from pyspark.sql.functions import row_number, rank, dense_rank
|
||||
|
||||
window_spec = Window.partitionBy("category").orderBy(col("date").desc())
|
||||
|
||||
df_windowed = df.withColumn(
|
||||
"row_num",
|
||||
row_number().over(window_spec)
|
||||
).filter(col("row_num") == 1)
|
||||
```
|
||||
|
||||
### Aggregations
|
||||
```python
|
||||
from pyspark.sql.functions import sum, avg, count, max, min
|
||||
|
||||
df_agg = df.groupBy("category").agg(
|
||||
count("*").alias("total_count"),
|
||||
sum("amount").alias("total_amount"),
|
||||
avg("amount").alias("avg_amount")
|
||||
)
|
||||
```
|
||||
|
||||
## JDBC Connection Pattern
|
||||
|
||||
```python
|
||||
def get_connection_properties() -> dict:
|
||||
"""Get JDBC connection properties."""
|
||||
return {
|
||||
"user": os.getenv("DB_USER"),
|
||||
"password": os.getenv("DB_PASSWORD"),
|
||||
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
||||
}
|
||||
|
||||
# Use for JDBC reads
|
||||
df = spark.read.jdbc(
|
||||
url=jdbc_url,
|
||||
table="schema.table",
|
||||
properties=get_connection_properties()
|
||||
)
|
||||
```
|
||||
|
||||
## Session Management
|
||||
|
||||
### Get Optimized Spark Session
|
||||
```python
|
||||
from utilities.session_optimiser import SparkOptimiser
|
||||
|
||||
spark = SparkOptimiser.get_optimised_spark_session()
|
||||
```
|
||||
|
||||
### Reset Spark Context
|
||||
```python
|
||||
table_utilities.reset_spark_context()
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Memory issues
|
||||
- Multiple Spark sessions
|
||||
- After large operations
|
||||
|
||||
## Memory Management
|
||||
|
||||
### Caching
|
||||
```python
|
||||
# Cache frequently accessed DataFrames
|
||||
df_cached = df.cache()
|
||||
|
||||
# Unpersist when done
|
||||
df_cached.unpersist()
|
||||
```
|
||||
|
||||
### Partitioning
|
||||
```python
|
||||
# Repartition for better parallelism
|
||||
df_repartitioned = df.repartition(10)
|
||||
|
||||
# Coalesce to reduce partitions
|
||||
df_coalesced = df.coalesce(1)
|
||||
```
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
1. **Don't use print() statements** - Use logger methods
|
||||
2. **Don't read entire tables without filtering** - Filter early
|
||||
3. **Don't create DataFrames inside loops** - Collect and batch
|
||||
4. **Don't use collect() on large DataFrames** - Process distributedly
|
||||
5. **Don't forget to unpersist cached DataFrames** - Memory leaks
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Filter early**: Reduce data volume ASAP
|
||||
2. **Use broadcast for small tables**: Optimize joins
|
||||
3. **Partition strategically**: Balance parallelism
|
||||
4. **Cache wisely**: Only for reused DataFrames
|
||||
5. **Use window functions**: Instead of self-joins
|
||||
|
||||
## Code Quality Standards
|
||||
|
||||
### Type Hints
|
||||
```python
|
||||
from pyspark.sql import DataFrame
|
||||
|
||||
def process_data(df: DataFrame, table_name: str) -> DataFrame:
|
||||
return df.filter(col("active") == True)
|
||||
```
|
||||
|
||||
### Line Length
|
||||
**Maximum: 240 characters** (not standard 88/120)
|
||||
|
||||
### Blank Lines
|
||||
**No blank lines inside functions** - Keep functions compact
|
||||
|
||||
### Imports
|
||||
All imports at top of file, never inside functions
|
||||
```python
|
||||
from pyspark.sql import DataFrame
|
||||
from pyspark.sql.functions import col, lit, when
|
||||
from utilities.session_optimiser import TableUtilities, NotebookLogger
|
||||
```
|
||||
Reference in New Issue
Block a user