zhongwei/gh-linus-mcmanamey-unify-2-1-plugin

Files

Zhongwei Li 506a828b22 Initial commit

2025-11-30 08:37:55 +08:00

23 KiB

Executable File

Raw Blame History

PySpark Error Fixing Command

Objective

Execute make gold_table and systematically fix all errors encountered in the PySpark gold layer file using specialized agents. Errors may be code-based (syntax, type, runtime) or logical (incorrect joins, missing data, business rule violations).

Agent Workflow (MANDATORY)

Phase 1: Error Fixing with pyspark-engineer

CRITICAL: All PySpark error fixing MUST be performed by the pyspark-engineer agent. Do NOT attempt to fix errors directly.

Launch the pyspark-engineer agent with:
- Full error stack trace and context
- Target file path
- All relevant schema information from MCP server
- Data dictionary references
The pyspark-engineer will:
- Validate MCP server connectivity
- Query schemas and foreign key relationships
- Analyze and fix all errors systematically
- Apply fixes following project coding standards
- Run quality gates (py_compile, ruff check, ruff format)

Phase 2: Code Review with code-reviewer

CRITICAL: After pyspark-engineer completes fixes, MUST launch the code-reviewer agent.

Launch the code-reviewer agent with:
- Path to the fixed file(s)
- Context: "PySpark gold layer error fixes"
- Request comprehensive review focusing on:
  - PySpark best practices
  - Join logic correctness
  - Schema alignment
  - Business rule implementation
  - Code quality and standards adherence
The code-reviewer will provide:
- Detailed feedback on all issues found
- Security vulnerabilities
- Performance optimization opportunities
- Code quality improvements needed

Phase 3: Iterative Refinement (MANDATORY LOOP)

CRITICAL: The review-refactor cycle MUST continue until code-reviewer is 100% satisfied.

If code-reviewer identifies ANY issues:
- Launch pyspark-engineer again with code-reviewer's feedback
- pyspark-engineer implements all recommended changes
- Launch code-reviewer again to re-validate
Repeat Phase 1 → Phase 2 → Phase 3 until:
- code-reviewer explicitly states: "✓ 100% SATISFIED - No further changes required"
- Zero issues, warnings, or concerns remain
- All quality gates pass
- All business rules validated
Only then is the error fixing task complete.

DO NOT PROCEED TO COMPLETION until code-reviewer gives explicit 100% satisfaction confirmation.

Pre-Execution Requirements

1. Python Coding Standards (CRITICAL - READ FIRST)

MANDATORY: All code MUST follow .claude/rules/python_rules.md standards:

Line 19: Use DataFrame API not Spark SQL
Line 20: Do NOT use DataFrame aliases (e.g., .alias("l")) or col() function - use direct string references or df["column"] syntax
Line 8: Limit line length to 240 characters
Line 9-10: Single line per statement, no carriage returns mid-statement
Line 10, 12: No blank lines inside functions
Line 11: Close parentheses on the last line of code
Line 5: Use type hints for all function parameters and return values
Line 18: Import statements only at the start of file, never inside functions
Line 16: Run ruff check and ruff format before finalizing
Import only necessary PySpark functions: from pyspark.sql.functions import when, coalesce, lit (NO col() usage - use direct references instead)

2. Identify Target File

Default target: python_files/gold/<INSERT FILE NAME>.py
Override via Makefile: G_RUN_FILE_NAME variable (line 63)
Verify file exists before execution

3. Environment Context

Runtime Environment: Local development (not Azure Synapse)
Working Directory: /workspaces/unify_2_1_dm_synapse_env_d10
Python Version: 3.11+
Spark Mode: Local cluster (local[*])
Data Location: /workspaces/data (parquet files)

4. Available Resources

Data Dictionary: .claude/data_dictionary/*.md - schema definitions for all CMS, FVMS, NicheRMS tables
Configuration: configuration.yaml - database lists, null replacements, Azure settings
MCP Schema Server: mcp-server-motherduck - live schema access via MCP (REQUIRED for schema verification)
Utilities Module: python_files/utilities/session_optimiser.py - TableUtilities, NotebookLogger, decorators
Example Files: Other python_files/gold/g_*.py files for reference patterns

5. MCP Server Validation (CRITICAL)

BEFORE PROCEEDING, verify MCP server connectivity:

Test MCP Server Connection:
- Attempt to query any known table schema via MCP
- Example test: Query schema for a common table (e.g., silver_cms.s_cms_offence_report)
Validation Criteria:
- MCP server must respond with valid schema data
- Schema must include column names, data types, and nullability
- Response must be recent (not cached/stale data)

Failure Handling:

⚠️  STOP: MCP Server Not Available

The MCP server (mcp-server-motherduck) is not responding or not providing valid schema data.

This command requires live schema access to:
- Verify column names and data types
- Validate join key compatibility
- Check foreign key relationships
- Ensure accurate schema matching

Actions Required:
1. Check MCP server status and configuration
2. Verify MotherDuck connection credentials
3. Ensure schema database is accessible
4. Restart MCP server if necessary

Cannot proceed with error fixing without verified schema access.
Use data dictionary files as fallback, but warn user of potential schema drift.

Success Confirmation:

✓ MCP Server Connected
✓ Schema data available
✓ Proceeding with error fixing workflow

Error Detection Strategy

Phase 1: Execute and Capture Errors

Run: make gold_table
Capture full stack trace including:
- Error type (AttributeError, KeyError, AnalysisException, etc.)
- Line number and function name
- Failed DataFrame operation
- Column names involved
- Join conditions if applicable

Phase 2: Categorize Error Types

A. Code-Based Errors

Syntax/Import Errors

Missing imports from pyspark.sql.functions
Incorrect function signatures
Type hint violations
Decorator usage errors

Runtime Errors

AnalysisException: Column not found, table doesn't exist
AttributeError: Calling non-existent DataFrame methods
KeyError: Dictionary access failures
TypeError: Incompatible data types in operations

DataFrame Schema Errors

Column name mismatches (case sensitivity)
Duplicate column names after joins
Missing required columns for downstream operations
Incorrect column aliases

B. Logical Errors

Join Issues

Incorrect Join Keys: Joining on wrong columns (e.g., offence_report_id vs cms_offence_report_id)
Missing Table Aliases: Ambiguous column references after joins
Wrong Join Types: Using inner when left is required (or vice versa)
Cartesian Products: Missing join conditions causing data explosion
Broadcast Misuse: Not using broadcast() for small dimension tables
Duplicate Join Keys: Multiple rows with same key causing row multiplication

Aggregation Problems

Incorrect groupBy() columns
Missing aggregation functions (first(), last(), collect_list())
Wrong window specifications
Aggregating on nullable columns without coalesce()

Business Rule Violations

Incorrect date/time logic (e.g., using reported_date_time when date_created should be fallback)
Missing null handling for critical fields
Status code logic errors
Incorrect coalesce order

Data Quality Issues

Expected vs actual row counts (use logger.info(f"Expected X rows, got {df.count()}"))
Null propagation in critical columns
Duplicate records not being handled
Missing deduplication logic

Systematic Debugging Process

Step 1: Schema Verification

For each source table mentioned in the error:

PRIMARY: Query MCP Server for Schema (MANDATORY FIRST STEP):
- Use MCP tools to query table schema from MotherDuck
- Extract column names, data types, nullability, and constraints
- Verify foreign key relationships for join operations
- Cross-reference with error column names
Example MCP Query Pattern:
```
Query: "Get schema for table silver_cms.s_cms_offence_report"
Expected Response: Column list with types and constraints
```
If MCP Server Fails:
- STOP and warn user (see Section 4: MCP Server Validation)
- Do NOT proceed with fixing without schema verification
- Suggest user check MCP server configuration
SECONDARY: Verify Schema Using Data Dictionary (as supplementary reference):
- Read .claude/data_dictionary/{source}_{table}.md
- Compare MCP schema vs data dictionary for consistency
- Note any schema drift or discrepancies
- Alert user if schemas don't match

Check Table Existence:

spark.sql("SHOW TABLES IN silver_cms").show()

Inspect Actual Runtime Schema (validate MCP data):
```
df = spark.read.table("silver_cms.s_cms_offence_report")
df.printSchema()
df.select([col for col in df.columns[:10]]).show(5, truncate=False)
```
Compare:
- MCP schema vs Spark runtime schema
- Report any mismatches to user
- Use runtime schema as source of truth if conflicts exist
Use DuckDB Schema (if available, as additional validation):
- Query schema.db for column definitions
- Check foreign key relationships
- Validate join key data types
- Triangulate: MCP + DuckDB + Data Dictionary should align

Step 2: Join Logic Validation

For each join operation:

Use MCP Server to Validate Join Relationships:
- Query foreign key constraints from MCP schema server
- Identify correct join column names and data types
- Verify parent-child table relationships
- Confirm join key nullability (affects join results)
Example MCP Queries:
```
Query: "Show foreign keys for table silver_cms.s_cms_offence_report"
Query: "What columns link s_cms_offence_report to s_cms_case_file?"
Query: "Get data type for column cms_offence_report_id in silver_cms.s_cms_offence_report"
```
If MCP Returns No Foreign Keys:
- Fall back to data dictionary documentation
- Check .claude/data_dictionary/ for relationship diagrams
- Manually verify join logic with business analyst

Verify Join Keys Exist (using MCP-confirmed column names):

left_df.select("join_key_column").show(5)
right_df.select("join_key_column").show(5)

Check Join Key Data Type Compatibility (cross-reference with MCP schema):

# Verify types match MCP schema expectations
left_df.select("join_key_column").dtypes
right_df.select("join_key_column").dtypes

Check Join Key Uniqueness:

left_df.groupBy("join_key_column").count().filter("count > 1").show()

Validate Join Type:
- left: Keep all left records (most common for fact-to-dimension)
- inner: Only matching records
- Use broadcast() for small lookup tables (< 10MB)
- Confirm join type matches MCP foreign key relationship (nullable FK → left join)

Handle Ambiguous Columns:

# BEFORE (causes ambiguity if both tables have same column names)
joined_df = left_df.join(right_df, on="common_id", how="left")

# AFTER (select specific columns to avoid ambiguity)
left_cols = [c for c in left_df.columns]
right_cols = ["dimension_field"]
joined_df = left_df.join(right_df, on="common_id", how="left").select(left_cols + right_cols)

Step 3: Aggregation Verification

Check groupBy Columns:
- Must include all columns not being aggregated
- Verify columns exist in DataFrame

Validate Aggregation Functions:

from pyspark.sql.functions import min, max, first, count, sum, coalesce, lit

aggregated = df.groupBy("key").agg(min("date_column").alias("earliest_date"), max("date_column").alias("latest_date"), first("dimension_column", ignorenulls=True).alias("dimension"), count("*").alias("record_count"), coalesce(sum("amount"), lit(0)).alias("total_amount"))

Test Aggregation Logic:
- Run aggregation on small sample
- Compare counts before/after
- Check for unexpected nulls

Step 4: Business Rule Testing

Verify Timestamp Logic:

from pyspark.sql.functions import when

df.select("reported_date_time", "date_created", when(df["reported_date_time"].isNotNull(), df["reported_date_time"]).otherwise(df["date_created"]).alias("final_timestamp")).show(10)

Test Null Handling:

from pyspark.sql.functions import coalesce, lit

df.select("primary_field", "fallback_field", coalesce(df["primary_field"], df["fallback_field"], lit(0)).alias("result")).show(10)

Validate Status/Lookup Logic:
- Check status code mappings against data dictionary
- Verify conditional logic matches business requirements

Common Error Patterns and Fixes

Pattern 1: Column Not Found After Join

Error: AnalysisException: Column 'offence_report_id' not found

Root Cause: Incorrect column name - verify column exists using MCP schema

Fix:

# BEFORE - wrong column name
df = left_df.join(right_df, on="offence_report_id", how="left")

# AFTER - MCP-verified correct column name
df = left_df.join(right_df, on="cms_offence_report_id", how="left")

# If joining on different column names between tables:
df = left_df.join(
    right_df,
    left_df["cms_offence_report_id"] == right_df["offence_report_id"],
    how="left"
)

Pattern 2: Duplicate Column Names

Error: Multiple columns with same name causing selection issues

Fix:

# BEFORE - causes duplicate 'id' column
joined = left_df.join(right_df, left_df["id"] == right_df["id"], how="left")

# AFTER - drop duplicate from right table before join
right_df_clean = right_df.drop("id")
joined = left_df.join(right_df_clean, left_df["id"] == right_df["id"], how="left")

# OR - rename columns to avoid duplicates
right_df_renamed = right_df.withColumnRenamed("id", "related_id")
joined = left_df.join(right_df_renamed, left_df["id"] == right_df_renamed["related_id"], how="left")

Pattern 3: Incorrect Aggregation

Error: Column not in GROUP BY causing aggregation failure

Fix:

from pyspark.sql.functions import min, first

# BEFORE - non-aggregated column not in groupBy
df.groupBy("key1").agg(min("date_field"), "non_aggregated_field")

# AFTER - all non-grouped columns must be aggregated
df = df.groupBy("key1").agg(min("date_field").alias("min_date"), first("non_aggregated_field", ignorenulls=True).alias("non_aggregated_field"))

Pattern 4: Join Key Mismatch

Error: No matching records or unexpected cartesian product

Fix:

left_df.select("join_key").show(20)
right_df.select("join_key").show(20)
left_df.select("join_key").dtypes
right_df.select("join_key").dtypes
left_df.filter(left_df["join_key"].isNull()).count()
right_df.filter(right_df["join_key"].isNull()).count()
result = left_df.join(right_df, left_df["join_key"].cast("int") == right_df["join_key"].cast("int"), how="left")

Pattern 5: Missing Null Handling

Error: Unexpected nulls propagating through transformations

Fix:

from pyspark.sql.functions import coalesce, lit

# BEFORE - NULL if either field is NULL
df = df.withColumn("result", df["field1"] + df["field2"])

# AFTER - handle nulls with coalesce
df = df.withColumn("result", coalesce(df["field1"], lit(0)) + coalesce(df["field2"], lit(0)))

Validation Requirements

After fixing errors, validate:

Row Counts: Log and verify expected vs actual counts at each transformation
Schema: Ensure output schema matches target table requirements
Nulls: Check critical columns for unexpected nulls
Duplicates: Verify uniqueness of ID columns
Data Ranges: Check timestamp ranges and numeric bounds
Join Results: Sample joined records to verify correctness

Logging Requirements

Use NotebookLogger throughout:

logger = NotebookLogger()

# Start of operation
logger.info(f"Starting extraction from {table_name}")

# After DataFrame creation
logger.info(f"Extracted {df.count()} records from {table_name}")

# After join
logger.info(f"Join completed: {joined_df.count()} records (expected ~X)")

# After transformation
logger.info(f"Transformation complete: {final_df.count()} records")

# On error
logger.error(f"Failed to process {table_name}: {error_message}")

# On success
logger.success(f"Successfully loaded {target_table_name}")

Quality Gates (Must Run After Fixes)

# 1. Syntax validation
python3 -m py_compile python_files/gold/g_x_mg_cms_mo.py

# 2. Code quality check
ruff check python_files/gold/g_x_mg_cms_mo.py

# 3. Format code
ruff format python_files/gold/g_x_mg_cms_mo.py

# 4. Run fixed code
make gold_table

Key Principles for PySpark Engineer Agent

CRITICAL: Agent Workflow Required: ALL error fixing must follow the 3-phase agent workflow (pyspark-engineer → code-reviewer → iterative refinement until 100% satisfied)
CRITICAL: Validate MCP Server First: Before starting, verify MCP server connectivity and schema availability. STOP and warn user if unavailable.
Always Query MCP Schema First: Use MCP server to get authoritative schema data before fixing any errors. Cross-reference with data dictionary.
Use MCP for Join Validation: Query foreign key relationships from MCP to ensure correct join logic and column names.
DataFrame API Without Aliases or col(): Use DataFrame API (NOT Spark SQL). NO DataFrame aliases. NO col() function. Use direct string references (e.g., "column_name") or df["column"] syntax (e.g., df["column_name"]). Import only needed functions (e.g., from pyspark.sql.functions import when, coalesce)
Test Incrementally: Fix one error at a time, validate, then proceed
Log Everything: Add logging at every transformation step
Handle Nulls: Always consider null cases in business logic (check MCP nullability constraints)
Verify Join Logic: Check join keys, types, and uniqueness before implementing (use MCP data types)
Use Utilities: Leverage TableUtilities methods (add_row_hash, save_as_table, clean_date_time_columns)
Follow Patterns: Reference working gold layer files for established patterns
Validate Business Rules: Confirm logic with MCP schema, data dictionary, and user story requirements
Clean Code: Adhere to project standards (240 char line length, no blank lines in functions, type hints, single line per statement)
Triple-Check Schemas: When schema mismatch occurs, verify MCP → Runtime → Data Dictionary alignment and report discrepancies
Code Review Loop: Continue refactoring until code-reviewer explicitly confirms 100% satisfaction with zero remaining issues

Example Workflow with MCP Integration

Scenario: Fixing AnalysisException - Column 'offence_report_id' not found

Step 1: MCP Server Validation

Agent: Testing MCP server connectivity...
MCP Query: "List databases available"
MCP Response: ✓ [bronze_cms, silver_cms, silver_fvms, gold_data_model, ...]

Agent: ✓ MCP Server Connected
Agent: ✓ Schema data available
Agent: ✓ Proceeding with error fixing workflow

Step 2: Query Schema via MCP

Agent: Querying schema for silver_cms.s_cms_offence_report via MCP...
MCP Query: "DESCRIBE TABLE silver_cms.s_cms_offence_report"
MCP Response:
  - cms_offence_report_id (BIGINT, PRIMARY KEY, NOT NULL)
  - case_file_id (BIGINT, FOREIGN KEY → s_cms_case_file.cms_case_file_id, NULLABLE)
  - reported_date_time (TIMESTAMP, NULLABLE)
  - date_created (TIMESTAMP, NOT NULL)
  - status_code (VARCHAR, NULLABLE)
  ... (50+ more columns)

Agent: Found correct column name: 'cms_offence_report_id' (not 'offence_report_id')

Step 3: Verify Join Relationship via MCP

Agent: Checking foreign key relationship for join...
MCP Query: "Show foreign keys from s_cms_offence_report to s_cms_case_file"
MCP Response:
  - s_cms_offence_report.case_file_id → s_cms_case_file.cms_case_file_id
  - Relationship: MANY-TO-ONE
  - Nullability: case_file_id is NULLABLE → Use LEFT JOIN

Agent: ✓ Join should be: LEFT JOIN on case_file_id = cms_case_file_id

Step 4: Apply Fix with MCP-Verified Schema

# BEFORE (error)
offence_df = spark.read.table("silver_cms.s_cms_offence_report")
case_df = spark.read.table("silver_cms.s_cms_case_file")
joined = offence_df.join(case_df, on="offence_report_id", how="left")

# AFTER (MCP-verified) - Use DataFrame API with correct column names
# MCP-confirmed FK: case_file_id → cms_case_file_id
# MCP-confirmed nullable FK → LEFT JOIN
# MCP-confirmed PK: cms_offence_report_id
case_df_subset = case_df.select("cms_case_file_id", "case_file_number")
joined = offence_df.join(case_df_subset, offence_df["case_file_id"] == case_df_subset["cms_case_file_id"], how="left")
joined = joined.withColumnRenamed("cms_offence_report_id", "offence_report_id")

Step 5: Validate Fix

Agent: Running fixed code...
Agent: ✓ No AnalysisException
Agent: ✓ Join executed successfully
Agent: Row count: 45,823 (expected ~45,000)
Agent: ✓ Schema matches MCP expectations

Success Criteria

Phase 1: Initial Error Fixing (pyspark-engineer)

MCP Server validated and responding (MANDATORY FIRST CHECK)
Schema verified via MCP server for all source tables
Foreign key relationships confirmed via MCP queries
All syntax errors resolved
All runtime errors fixed
Join logic validated and correct (using MCP-confirmed column names and types)
DataFrame API used (NOT Spark SQL) per python_rules.md line 19
NO DataFrame aliases or col() function used - direct string references or df["column"] syntax only (per python_rules.md line 20)
Code follows python_rules.md standards: 240 char lines, no blank lines in functions, single line per statement, imports at top only
Row counts logged and reasonable
Business rules implemented correctly
Output schema matches requirements (cross-referenced with MCP schema)
Code passes quality gates (py_compile, ruff check, ruff format)
make gold_table executes successfully
Target table created/updated in gold_data_model database
No schema drift reported between MCP, Runtime, and Data Dictionary sources

Phase 2: Code Review (code-reviewer)

code-reviewer agent launched with fixed code
Comprehensive review completed covering:
- PySpark best practices adherence
- Join logic correctness
- Schema alignment validation
- Business rule implementation accuracy
- Code quality and standards compliance
- Security vulnerabilities (none found)
- Performance optimization opportunities addressed

Phase 3: Iterative Refinement (MANDATORY UNTIL 100% SATISFIED)

All code-reviewer feedback items addressed by pyspark-engineer
Re-review completed by code-reviewer
Iteration cycle repeated until code-reviewer explicitly confirms:
- "✓ 100% SATISFIED - No further changes required"
- Zero remaining issues, warnings, or concerns
- All quality gates pass
- All business rules validated
- Code meets production-ready standards

Final Approval

code-reviewer has explicitly confirmed 100% satisfaction
No outstanding issues or concerns remain
Task is complete and ready for production deployment

23 KiB Executable File Raw Blame History

PySpark Error Fixing Command

Objective

Agent Workflow (MANDATORY)

Phase 1: Error Fixing with pyspark-engineer

Phase 2: Code Review with code-reviewer

Phase 3: Iterative Refinement (MANDATORY LOOP)

Pre-Execution Requirements

1. Python Coding Standards (CRITICAL - READ FIRST)

2. Identify Target File

3. Environment Context

4. Available Resources

5. MCP Server Validation (CRITICAL)

Error Detection Strategy

Phase 1: Execute and Capture Errors

Phase 2: Categorize Error Types

A. Code-Based Errors

B. Logical Errors

Systematic Debugging Process

Step 1: Schema Verification

Step 2: Join Logic Validation

Step 3: Aggregation Verification

Step 4: Business Rule Testing

Common Error Patterns and Fixes

Pattern 1: Column Not Found After Join

Pattern 2: Duplicate Column Names

Pattern 3: Incorrect Aggregation

Pattern 4: Join Key Mismatch

Pattern 5: Missing Null Handling

Validation Requirements

Logging Requirements

Quality Gates (Must Run After Fixes)

Key Principles for PySpark Engineer Agent

Example Workflow with MCP Integration

Scenario: Fixing AnalysisException - Column 'offence_report_id' not found

Success Criteria

Phase 1: Initial Error Fixing (pyspark-engineer)

Phase 2: Code Review (code-reviewer)

Phase 3: Iterative Refinement (MANDATORY UNTIL 100% SATISFIED)

Final Approval

23 KiB

Executable File

Raw Blame History