Initial commit
This commit is contained in:
607
commands/pyspark-errors.md
Executable file
607
commands/pyspark-errors.md
Executable file
@@ -0,0 +1,607 @@
|
||||
# PySpark Error Fixing Command
|
||||
|
||||
## Objective
|
||||
Execute `make gold_table` and systematically fix all errors encountered in the PySpark gold layer file using specialized agents. Errors may be code-based (syntax, type, runtime) or logical (incorrect joins, missing data, business rule violations).
|
||||
|
||||
## Agent Workflow (MANDATORY)
|
||||
|
||||
### Phase 1: Error Fixing with pyspark-engineer
|
||||
**CRITICAL**: All PySpark error fixing MUST be performed by the `pyspark-engineer` agent. Do NOT attempt to fix errors directly.
|
||||
|
||||
1. Launch the `pyspark-engineer` agent with:
|
||||
- Full error stack trace and context
|
||||
- Target file path
|
||||
- All relevant schema information from MCP server
|
||||
- Data dictionary references
|
||||
|
||||
2. The pyspark-engineer will:
|
||||
- Validate MCP server connectivity
|
||||
- Query schemas and foreign key relationships
|
||||
- Analyze and fix all errors systematically
|
||||
- Apply fixes following project coding standards
|
||||
- Run quality gates (py_compile, ruff check, ruff format)
|
||||
|
||||
### Phase 2: Code Review with code-reviewer
|
||||
**CRITICAL**: After pyspark-engineer completes fixes, MUST launch the `code-reviewer` agent.
|
||||
|
||||
1. Launch the `code-reviewer` agent with:
|
||||
- Path to the fixed file(s)
|
||||
- Context: "PySpark gold layer error fixes"
|
||||
- Request comprehensive review focusing on:
|
||||
- PySpark best practices
|
||||
- Join logic correctness
|
||||
- Schema alignment
|
||||
- Business rule implementation
|
||||
- Code quality and standards adherence
|
||||
|
||||
2. The code-reviewer will provide:
|
||||
- Detailed feedback on all issues found
|
||||
- Security vulnerabilities
|
||||
- Performance optimization opportunities
|
||||
- Code quality improvements needed
|
||||
|
||||
### Phase 3: Iterative Refinement (MANDATORY LOOP)
|
||||
**CRITICAL**: The review-refactor cycle MUST continue until code-reviewer is 100% satisfied.
|
||||
|
||||
1. If code-reviewer identifies ANY issues:
|
||||
- Launch pyspark-engineer again with code-reviewer's feedback
|
||||
- pyspark-engineer implements all recommended changes
|
||||
- Launch code-reviewer again to re-validate
|
||||
|
||||
2. Repeat Phase 1 → Phase 2 → Phase 3 until:
|
||||
- code-reviewer explicitly states: "✓ 100% SATISFIED - No further changes required"
|
||||
- Zero issues, warnings, or concerns remain
|
||||
- All quality gates pass
|
||||
- All business rules validated
|
||||
|
||||
3. Only then is the error fixing task complete.
|
||||
|
||||
**DO NOT PROCEED TO COMPLETION** until code-reviewer gives explicit 100% satisfaction confirmation.
|
||||
|
||||
## Pre-Execution Requirements
|
||||
|
||||
### 1. Python Coding Standards (CRITICAL - READ FIRST)
|
||||
**MANDATORY**: All code MUST follow `.claude/rules/python_rules.md` standards:
|
||||
- **Line 19**: Use DataFrame API not Spark SQL
|
||||
- **Line 20**: Do NOT use DataFrame aliases (e.g., `.alias("l")`) or `col()` function - use direct string references or `df["column"]` syntax
|
||||
- **Line 8**: Limit line length to 240 characters
|
||||
- **Line 9-10**: Single line per statement, no carriage returns mid-statement
|
||||
- **Line 10, 12**: No blank lines inside functions
|
||||
- **Line 11**: Close parentheses on the last line of code
|
||||
- **Line 5**: Use type hints for all function parameters and return values
|
||||
- **Line 18**: Import statements only at the start of file, never inside functions
|
||||
- **Line 16**: Run `ruff check` and `ruff format` before finalizing
|
||||
- Import only necessary PySpark functions: `from pyspark.sql.functions import when, coalesce, lit` (NO col() usage - use direct references instead)
|
||||
|
||||
### 2. Identify Target File
|
||||
- Default target: `python_files/gold/<INSERT FILE NAME>.py`
|
||||
- Override via Makefile: `G_RUN_FILE_NAME` variable (line 63)
|
||||
- Verify file exists before execution
|
||||
|
||||
### 3. Environment Context
|
||||
- **Runtime Environment**: Local development (not Azure Synapse)
|
||||
- **Working Directory**: `/workspaces/unify_2_1_dm_synapse_env_d10`
|
||||
- **Python Version**: 3.11+
|
||||
- **Spark Mode**: Local cluster (`local[*]`)
|
||||
- **Data Location**: `/workspaces/data` (parquet files)
|
||||
|
||||
### 4. Available Resources
|
||||
- **Data Dictionary**: `.claude/data_dictionary/*.md` - schema definitions for all CMS, FVMS, NicheRMS tables
|
||||
- **Configuration**: `configuration.yaml` - database lists, null replacements, Azure settings
|
||||
- **MCP Schema Server**: `mcp-server-motherduck` - live schema access via MCP (REQUIRED for schema verification)
|
||||
- **Utilities Module**: `python_files/utilities/session_optimiser.py` - TableUtilities, NotebookLogger, decorators
|
||||
- **Example Files**: Other `python_files/gold/g_*.py` files for reference patterns
|
||||
|
||||
### 5. MCP Server Validation (CRITICAL)
|
||||
**BEFORE PROCEEDING**, verify MCP server connectivity:
|
||||
|
||||
1. **Test MCP Server Connection**:
|
||||
- Attempt to query any known table schema via MCP
|
||||
- Example test: Query schema for a common table (e.g., `silver_cms.s_cms_offence_report`)
|
||||
|
||||
2. **Validation Criteria**:
|
||||
- MCP server must respond with valid schema data
|
||||
- Schema must include column names, data types, and nullability
|
||||
- Response must be recent (not cached/stale data)
|
||||
|
||||
3. **Failure Handling**:
|
||||
```
|
||||
⚠️ STOP: MCP Server Not Available
|
||||
|
||||
The MCP server (mcp-server-motherduck) is not responding or not providing valid schema data.
|
||||
|
||||
This command requires live schema access to:
|
||||
- Verify column names and data types
|
||||
- Validate join key compatibility
|
||||
- Check foreign key relationships
|
||||
- Ensure accurate schema matching
|
||||
|
||||
Actions Required:
|
||||
1. Check MCP server status and configuration
|
||||
2. Verify MotherDuck connection credentials
|
||||
3. Ensure schema database is accessible
|
||||
4. Restart MCP server if necessary
|
||||
|
||||
Cannot proceed with error fixing without verified schema access.
|
||||
Use data dictionary files as fallback, but warn user of potential schema drift.
|
||||
```
|
||||
|
||||
4. **Success Confirmation**:
|
||||
```
|
||||
✓ MCP Server Connected
|
||||
✓ Schema data available
|
||||
✓ Proceeding with error fixing workflow
|
||||
```
|
||||
|
||||
## Error Detection Strategy
|
||||
|
||||
### Phase 1: Execute and Capture Errors
|
||||
1. Run: `make gold_table`
|
||||
2. Capture full stack trace including:
|
||||
- Error type (AttributeError, KeyError, AnalysisException, etc.)
|
||||
- Line number and function name
|
||||
- Failed DataFrame operation
|
||||
- Column names involved
|
||||
- Join conditions if applicable
|
||||
|
||||
### Phase 2: Categorize Error Types
|
||||
|
||||
#### A. Code-Based Errors
|
||||
|
||||
**Syntax/Import Errors**
|
||||
- Missing imports from `pyspark.sql.functions`
|
||||
- Incorrect function signatures
|
||||
- Type hint violations
|
||||
- Decorator usage errors
|
||||
|
||||
**Runtime Errors**
|
||||
- `AnalysisException`: Column not found, table doesn't exist
|
||||
- `AttributeError`: Calling non-existent DataFrame methods
|
||||
- `KeyError`: Dictionary access failures
|
||||
- `TypeError`: Incompatible data types in operations
|
||||
|
||||
**DataFrame Schema Errors**
|
||||
- Column name mismatches (case sensitivity)
|
||||
- Duplicate column names after joins
|
||||
- Missing required columns for downstream operations
|
||||
- Incorrect column aliases
|
||||
|
||||
#### B. Logical Errors
|
||||
|
||||
**Join Issues**
|
||||
- **Incorrect Join Keys**: Joining on wrong columns (e.g., `offence_report_id` vs `cms_offence_report_id`)
|
||||
- **Missing Table Aliases**: Ambiguous column references after joins
|
||||
- **Wrong Join Types**: Using `inner` when `left` is required (or vice versa)
|
||||
- **Cartesian Products**: Missing join conditions causing data explosion
|
||||
- **Broadcast Misuse**: Not using `broadcast()` for small dimension tables
|
||||
- **Duplicate Join Keys**: Multiple rows with same key causing row multiplication
|
||||
|
||||
**Aggregation Problems**
|
||||
- Incorrect `groupBy()` columns
|
||||
- Missing aggregation functions (`first()`, `last()`, `collect_list()`)
|
||||
- Wrong window specifications
|
||||
- Aggregating on nullable columns without `coalesce()`
|
||||
|
||||
**Business Rule Violations**
|
||||
- Incorrect date/time logic (e.g., using `reported_date_time` when `date_created` should be fallback)
|
||||
- Missing null handling for critical fields
|
||||
- Status code logic errors
|
||||
- Incorrect coalesce order
|
||||
|
||||
**Data Quality Issues**
|
||||
- Expected vs actual row counts (use `logger.info(f"Expected X rows, got {df.count()}")`)
|
||||
- Null propagation in critical columns
|
||||
- Duplicate records not being handled
|
||||
- Missing deduplication logic
|
||||
|
||||
## Systematic Debugging Process
|
||||
|
||||
### Step 1: Schema Verification
|
||||
For each source table mentioned in the error:
|
||||
|
||||
1. **PRIMARY: Query MCP Server for Schema** (MANDATORY FIRST STEP):
|
||||
- Use MCP tools to query table schema from MotherDuck
|
||||
- Extract column names, data types, nullability, and constraints
|
||||
- Verify foreign key relationships for join operations
|
||||
- Cross-reference with error column names
|
||||
|
||||
**Example MCP Query Pattern**:
|
||||
```
|
||||
Query: "Get schema for table silver_cms.s_cms_offence_report"
|
||||
Expected Response: Column list with types and constraints
|
||||
```
|
||||
|
||||
**If MCP Server Fails**:
|
||||
- STOP and warn user (see Section 4: MCP Server Validation)
|
||||
- Do NOT proceed with fixing without schema verification
|
||||
- Suggest user check MCP server configuration
|
||||
|
||||
2. **SECONDARY: Verify Schema Using Data Dictionary** (as supplementary reference):
|
||||
- Read `.claude/data_dictionary/{source}_{table}.md`
|
||||
- Compare MCP schema vs data dictionary for consistency
|
||||
- Note any schema drift or discrepancies
|
||||
- Alert user if schemas don't match
|
||||
|
||||
3. **Check Table Existence**:
|
||||
```python
|
||||
spark.sql("SHOW TABLES IN silver_cms").show()
|
||||
```
|
||||
|
||||
4. **Inspect Actual Runtime Schema** (validate MCP data):
|
||||
```python
|
||||
df = spark.read.table("silver_cms.s_cms_offence_report")
|
||||
df.printSchema()
|
||||
df.select([col for col in df.columns[:10]]).show(5, truncate=False)
|
||||
```
|
||||
|
||||
**Compare**:
|
||||
- MCP schema vs Spark runtime schema
|
||||
- Report any mismatches to user
|
||||
- Use runtime schema as source of truth if conflicts exist
|
||||
|
||||
5. **Use DuckDB Schema** (if available, as additional validation):
|
||||
- Query schema.db for column definitions
|
||||
- Check foreign key relationships
|
||||
- Validate join key data types
|
||||
- Triangulate: MCP + DuckDB + Data Dictionary should align
|
||||
|
||||
### Step 2: Join Logic Validation
|
||||
|
||||
For each join operation:
|
||||
|
||||
1. **Use MCP Server to Validate Join Relationships**:
|
||||
- Query foreign key constraints from MCP schema server
|
||||
- Identify correct join column names and data types
|
||||
- Verify parent-child table relationships
|
||||
- Confirm join key nullability (affects join results)
|
||||
|
||||
**Example MCP Queries**:
|
||||
```
|
||||
Query: "Show foreign keys for table silver_cms.s_cms_offence_report"
|
||||
Query: "What columns link s_cms_offence_report to s_cms_case_file?"
|
||||
Query: "Get data type for column cms_offence_report_id in silver_cms.s_cms_offence_report"
|
||||
```
|
||||
|
||||
**If MCP Returns No Foreign Keys**:
|
||||
- Fall back to data dictionary documentation
|
||||
- Check `.claude/data_dictionary/` for relationship diagrams
|
||||
- Manually verify join logic with business analyst
|
||||
|
||||
2. **Verify Join Keys Exist** (using MCP-confirmed column names):
|
||||
```python
|
||||
left_df.select("join_key_column").show(5)
|
||||
right_df.select("join_key_column").show(5)
|
||||
```
|
||||
|
||||
3. **Check Join Key Data Type Compatibility** (cross-reference with MCP schema):
|
||||
```python
|
||||
# Verify types match MCP schema expectations
|
||||
left_df.select("join_key_column").dtypes
|
||||
right_df.select("join_key_column").dtypes
|
||||
```
|
||||
|
||||
4. **Check Join Key Uniqueness**:
|
||||
```python
|
||||
left_df.groupBy("join_key_column").count().filter("count > 1").show()
|
||||
```
|
||||
|
||||
5. **Validate Join Type**:
|
||||
- `left`: Keep all left records (most common for fact-to-dimension)
|
||||
- `inner`: Only matching records
|
||||
- Use `broadcast()` for small lookup tables (< 10MB)
|
||||
- Confirm join type matches MCP foreign key relationship (nullable FK → left join)
|
||||
|
||||
6. **Handle Ambiguous Columns**:
|
||||
```python
|
||||
# BEFORE (causes ambiguity if both tables have same column names)
|
||||
joined_df = left_df.join(right_df, on="common_id", how="left")
|
||||
|
||||
# AFTER (select specific columns to avoid ambiguity)
|
||||
left_cols = [c for c in left_df.columns]
|
||||
right_cols = ["dimension_field"]
|
||||
joined_df = left_df.join(right_df, on="common_id", how="left").select(left_cols + right_cols)
|
||||
```
|
||||
|
||||
### Step 3: Aggregation Verification
|
||||
|
||||
1. **Check groupBy Columns**:
|
||||
- Must include all columns not being aggregated
|
||||
- Verify columns exist in DataFrame
|
||||
|
||||
2. **Validate Aggregation Functions**:
|
||||
```python
|
||||
from pyspark.sql.functions import min, max, first, count, sum, coalesce, lit
|
||||
|
||||
aggregated = df.groupBy("key").agg(min("date_column").alias("earliest_date"), max("date_column").alias("latest_date"), first("dimension_column", ignorenulls=True).alias("dimension"), count("*").alias("record_count"), coalesce(sum("amount"), lit(0)).alias("total_amount"))
|
||||
```
|
||||
|
||||
3. **Test Aggregation Logic**:
|
||||
- Run aggregation on small sample
|
||||
- Compare counts before/after
|
||||
- Check for unexpected nulls
|
||||
|
||||
### Step 4: Business Rule Testing
|
||||
|
||||
1. **Verify Timestamp Logic**:
|
||||
```python
|
||||
from pyspark.sql.functions import when
|
||||
|
||||
df.select("reported_date_time", "date_created", when(df["reported_date_time"].isNotNull(), df["reported_date_time"]).otherwise(df["date_created"]).alias("final_timestamp")).show(10)
|
||||
```
|
||||
|
||||
2. **Test Null Handling**:
|
||||
```python
|
||||
from pyspark.sql.functions import coalesce, lit
|
||||
|
||||
df.select("primary_field", "fallback_field", coalesce(df["primary_field"], df["fallback_field"], lit(0)).alias("result")).show(10)
|
||||
```
|
||||
|
||||
3. **Validate Status/Lookup Logic**:
|
||||
- Check status code mappings against data dictionary
|
||||
- Verify conditional logic matches business requirements
|
||||
|
||||
## Common Error Patterns and Fixes
|
||||
|
||||
### Pattern 1: Column Not Found After Join
|
||||
**Error**: `AnalysisException: Column 'offence_report_id' not found`
|
||||
|
||||
**Root Cause**: Incorrect column name - verify column exists using MCP schema
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# BEFORE - wrong column name
|
||||
df = left_df.join(right_df, on="offence_report_id", how="left")
|
||||
|
||||
# AFTER - MCP-verified correct column name
|
||||
df = left_df.join(right_df, on="cms_offence_report_id", how="left")
|
||||
|
||||
# If joining on different column names between tables:
|
||||
df = left_df.join(
|
||||
right_df,
|
||||
left_df["cms_offence_report_id"] == right_df["offence_report_id"],
|
||||
how="left"
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 2: Duplicate Column Names
|
||||
**Error**: Multiple columns with same name causing selection issues
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# BEFORE - causes duplicate 'id' column
|
||||
joined = left_df.join(right_df, left_df["id"] == right_df["id"], how="left")
|
||||
|
||||
# AFTER - drop duplicate from right table before join
|
||||
right_df_clean = right_df.drop("id")
|
||||
joined = left_df.join(right_df_clean, left_df["id"] == right_df["id"], how="left")
|
||||
|
||||
# OR - rename columns to avoid duplicates
|
||||
right_df_renamed = right_df.withColumnRenamed("id", "related_id")
|
||||
joined = left_df.join(right_df_renamed, left_df["id"] == right_df_renamed["related_id"], how="left")
|
||||
```
|
||||
|
||||
### Pattern 3: Incorrect Aggregation
|
||||
**Error**: Column not in GROUP BY causing aggregation failure
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
from pyspark.sql.functions import min, first
|
||||
|
||||
# BEFORE - non-aggregated column not in groupBy
|
||||
df.groupBy("key1").agg(min("date_field"), "non_aggregated_field")
|
||||
|
||||
# AFTER - all non-grouped columns must be aggregated
|
||||
df = df.groupBy("key1").agg(min("date_field").alias("min_date"), first("non_aggregated_field", ignorenulls=True).alias("non_aggregated_field"))
|
||||
```
|
||||
|
||||
### Pattern 4: Join Key Mismatch
|
||||
**Error**: No matching records or unexpected cartesian product
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
left_df.select("join_key").show(20)
|
||||
right_df.select("join_key").show(20)
|
||||
left_df.select("join_key").dtypes
|
||||
right_df.select("join_key").dtypes
|
||||
left_df.filter(left_df["join_key"].isNull()).count()
|
||||
right_df.filter(right_df["join_key"].isNull()).count()
|
||||
result = left_df.join(right_df, left_df["join_key"].cast("int") == right_df["join_key"].cast("int"), how="left")
|
||||
```
|
||||
|
||||
### Pattern 5: Missing Null Handling
|
||||
**Error**: Unexpected nulls propagating through transformations
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
from pyspark.sql.functions import coalesce, lit
|
||||
|
||||
# BEFORE - NULL if either field is NULL
|
||||
df = df.withColumn("result", df["field1"] + df["field2"])
|
||||
|
||||
# AFTER - handle nulls with coalesce
|
||||
df = df.withColumn("result", coalesce(df["field1"], lit(0)) + coalesce(df["field2"], lit(0)))
|
||||
```
|
||||
|
||||
## Validation Requirements
|
||||
|
||||
After fixing errors, validate:
|
||||
|
||||
1. **Row Counts**: Log and verify expected vs actual counts at each transformation
|
||||
2. **Schema**: Ensure output schema matches target table requirements
|
||||
3. **Nulls**: Check critical columns for unexpected nulls
|
||||
4. **Duplicates**: Verify uniqueness of ID columns
|
||||
5. **Data Ranges**: Check timestamp ranges and numeric bounds
|
||||
6. **Join Results**: Sample joined records to verify correctness
|
||||
|
||||
## Logging Requirements
|
||||
|
||||
Use `NotebookLogger` throughout:
|
||||
|
||||
```python
|
||||
logger = NotebookLogger()
|
||||
|
||||
# Start of operation
|
||||
logger.info(f"Starting extraction from {table_name}")
|
||||
|
||||
# After DataFrame creation
|
||||
logger.info(f"Extracted {df.count()} records from {table_name}")
|
||||
|
||||
# After join
|
||||
logger.info(f"Join completed: {joined_df.count()} records (expected ~X)")
|
||||
|
||||
# After transformation
|
||||
logger.info(f"Transformation complete: {final_df.count()} records")
|
||||
|
||||
# On error
|
||||
logger.error(f"Failed to process {table_name}: {error_message}")
|
||||
|
||||
# On success
|
||||
logger.success(f"Successfully loaded {target_table_name}")
|
||||
```
|
||||
|
||||
## Quality Gates (Must Run After Fixes)
|
||||
|
||||
```bash
|
||||
# 1. Syntax validation
|
||||
python3 -m py_compile python_files/gold/g_x_mg_cms_mo.py
|
||||
|
||||
# 2. Code quality check
|
||||
ruff check python_files/gold/g_x_mg_cms_mo.py
|
||||
|
||||
# 3. Format code
|
||||
ruff format python_files/gold/g_x_mg_cms_mo.py
|
||||
|
||||
# 4. Run fixed code
|
||||
make gold_table
|
||||
```
|
||||
|
||||
## Key Principles for PySpark Engineer Agent
|
||||
|
||||
1. **CRITICAL: Agent Workflow Required**: ALL error fixing must follow the 3-phase agent workflow (pyspark-engineer → code-reviewer → iterative refinement until 100% satisfied)
|
||||
2. **CRITICAL: Validate MCP Server First**: Before starting, verify MCP server connectivity and schema availability. STOP and warn user if unavailable.
|
||||
3. **Always Query MCP Schema First**: Use MCP server to get authoritative schema data before fixing any errors. Cross-reference with data dictionary.
|
||||
4. **Use MCP for Join Validation**: Query foreign key relationships from MCP to ensure correct join logic and column names.
|
||||
5. **DataFrame API Without Aliases or col()**: Use DataFrame API (NOT Spark SQL). NO DataFrame aliases. NO col() function. Use direct string references (e.g., `"column_name"`) or df["column"] syntax (e.g., `df["column_name"]`). Import only needed functions (e.g., `from pyspark.sql.functions import when, coalesce`)
|
||||
6. **Test Incrementally**: Fix one error at a time, validate, then proceed
|
||||
7. **Log Everything**: Add logging at every transformation step
|
||||
8. **Handle Nulls**: Always consider null cases in business logic (check MCP nullability constraints)
|
||||
9. **Verify Join Logic**: Check join keys, types, and uniqueness before implementing (use MCP data types)
|
||||
10. **Use Utilities**: Leverage `TableUtilities` methods (add_row_hash, save_as_table, clean_date_time_columns)
|
||||
11. **Follow Patterns**: Reference working gold layer files for established patterns
|
||||
12. **Validate Business Rules**: Confirm logic with MCP schema, data dictionary, and user story requirements
|
||||
13. **Clean Code**: Adhere to project standards (240 char line length, no blank lines in functions, type hints, single line per statement)
|
||||
14. **Triple-Check Schemas**: When schema mismatch occurs, verify MCP → Runtime → Data Dictionary alignment and report discrepancies
|
||||
15. **Code Review Loop**: Continue refactoring until code-reviewer explicitly confirms 100% satisfaction with zero remaining issues
|
||||
|
||||
## Example Workflow with MCP Integration
|
||||
|
||||
### Scenario: Fixing AnalysisException - Column 'offence_report_id' not found
|
||||
|
||||
**Step 1: MCP Server Validation**
|
||||
```
|
||||
Agent: Testing MCP server connectivity...
|
||||
MCP Query: "List databases available"
|
||||
MCP Response: ✓ [bronze_cms, silver_cms, silver_fvms, gold_data_model, ...]
|
||||
|
||||
Agent: ✓ MCP Server Connected
|
||||
Agent: ✓ Schema data available
|
||||
Agent: ✓ Proceeding with error fixing workflow
|
||||
```
|
||||
|
||||
**Step 2: Query Schema via MCP**
|
||||
```
|
||||
Agent: Querying schema for silver_cms.s_cms_offence_report via MCP...
|
||||
MCP Query: "DESCRIBE TABLE silver_cms.s_cms_offence_report"
|
||||
MCP Response:
|
||||
- cms_offence_report_id (BIGINT, PRIMARY KEY, NOT NULL)
|
||||
- case_file_id (BIGINT, FOREIGN KEY → s_cms_case_file.cms_case_file_id, NULLABLE)
|
||||
- reported_date_time (TIMESTAMP, NULLABLE)
|
||||
- date_created (TIMESTAMP, NOT NULL)
|
||||
- status_code (VARCHAR, NULLABLE)
|
||||
... (50+ more columns)
|
||||
|
||||
Agent: Found correct column name: 'cms_offence_report_id' (not 'offence_report_id')
|
||||
```
|
||||
|
||||
**Step 3: Verify Join Relationship via MCP**
|
||||
```
|
||||
Agent: Checking foreign key relationship for join...
|
||||
MCP Query: "Show foreign keys from s_cms_offence_report to s_cms_case_file"
|
||||
MCP Response:
|
||||
- s_cms_offence_report.case_file_id → s_cms_case_file.cms_case_file_id
|
||||
- Relationship: MANY-TO-ONE
|
||||
- Nullability: case_file_id is NULLABLE → Use LEFT JOIN
|
||||
|
||||
Agent: ✓ Join should be: LEFT JOIN on case_file_id = cms_case_file_id
|
||||
```
|
||||
|
||||
**Step 4: Apply Fix with MCP-Verified Schema**
|
||||
```python
|
||||
# BEFORE (error)
|
||||
offence_df = spark.read.table("silver_cms.s_cms_offence_report")
|
||||
case_df = spark.read.table("silver_cms.s_cms_case_file")
|
||||
joined = offence_df.join(case_df, on="offence_report_id", how="left")
|
||||
|
||||
# AFTER (MCP-verified) - Use DataFrame API with correct column names
|
||||
# MCP-confirmed FK: case_file_id → cms_case_file_id
|
||||
# MCP-confirmed nullable FK → LEFT JOIN
|
||||
# MCP-confirmed PK: cms_offence_report_id
|
||||
case_df_subset = case_df.select("cms_case_file_id", "case_file_number")
|
||||
joined = offence_df.join(case_df_subset, offence_df["case_file_id"] == case_df_subset["cms_case_file_id"], how="left")
|
||||
joined = joined.withColumnRenamed("cms_offence_report_id", "offence_report_id")
|
||||
```
|
||||
|
||||
**Step 5: Validate Fix**
|
||||
```
|
||||
Agent: Running fixed code...
|
||||
Agent: ✓ No AnalysisException
|
||||
Agent: ✓ Join executed successfully
|
||||
Agent: Row count: 45,823 (expected ~45,000)
|
||||
Agent: ✓ Schema matches MCP expectations
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Phase 1: Initial Error Fixing (pyspark-engineer)
|
||||
- [ ] **MCP Server validated and responding** (MANDATORY FIRST CHECK)
|
||||
- [ ] Schema verified via MCP server for all source tables
|
||||
- [ ] Foreign key relationships confirmed via MCP queries
|
||||
- [ ] All syntax errors resolved
|
||||
- [ ] All runtime errors fixed
|
||||
- [ ] Join logic validated and correct (using MCP-confirmed column names and types)
|
||||
- [ ] DataFrame API used (NOT Spark SQL) per python_rules.md line 19
|
||||
- [ ] NO DataFrame aliases or col() function used - direct string references or df["column"] syntax only (per python_rules.md line 20)
|
||||
- [ ] Code follows python_rules.md standards: 240 char lines, no blank lines in functions, single line per statement, imports at top only
|
||||
- [ ] Row counts logged and reasonable
|
||||
- [ ] Business rules implemented correctly
|
||||
- [ ] Output schema matches requirements (cross-referenced with MCP schema)
|
||||
- [ ] Code passes quality gates (py_compile, ruff check, ruff format)
|
||||
- [ ] `make gold_table` executes successfully
|
||||
- [ ] Target table created/updated in `gold_data_model` database
|
||||
- [ ] No schema drift reported between MCP, Runtime, and Data Dictionary sources
|
||||
|
||||
### Phase 2: Code Review (code-reviewer)
|
||||
- [ ] code-reviewer agent launched with fixed code
|
||||
- [ ] Comprehensive review completed covering:
|
||||
- [ ] PySpark best practices adherence
|
||||
- [ ] Join logic correctness
|
||||
- [ ] Schema alignment validation
|
||||
- [ ] Business rule implementation accuracy
|
||||
- [ ] Code quality and standards compliance
|
||||
- [ ] Security vulnerabilities (none found)
|
||||
- [ ] Performance optimization opportunities addressed
|
||||
|
||||
### Phase 3: Iterative Refinement (MANDATORY UNTIL 100% SATISFIED)
|
||||
- [ ] All code-reviewer feedback items addressed by pyspark-engineer
|
||||
- [ ] Re-review completed by code-reviewer
|
||||
- [ ] Iteration cycle repeated until code-reviewer explicitly confirms:
|
||||
- [ ] **"✓ 100% SATISFIED - No further changes required"**
|
||||
- [ ] Zero remaining issues, warnings, or concerns
|
||||
- [ ] All quality gates pass
|
||||
- [ ] All business rules validated
|
||||
- [ ] Code meets production-ready standards
|
||||
|
||||
### Final Approval
|
||||
- [ ] **code-reviewer has explicitly confirmed 100% satisfaction**
|
||||
- [ ] No outstanding issues or concerns remain
|
||||
- [ ] Task is complete and ready for production deployment
|
||||
Reference in New Issue
Block a user