Initial commit

2025-11-30 08:37:55 +08:00
commit 506a828b22
59 changed files with 18515 additions and 0 deletions
--- a/commands/pyspark-errors.md
+++ b/commands/pyspark-errors.md
@@ -0,0 +1,607 @@
+# PySpark Error Fixing Command
+
+## Objective
+Execute `make gold_table` and systematically fix all errors encountered in the PySpark gold layer file using specialized agents. Errors may be code-based (syntax, type, runtime) or logical (incorrect joins, missing data, business rule violations).
+
+## Agent Workflow (MANDATORY)
+
+### Phase 1: Error Fixing with pyspark-engineer
+**CRITICAL**: All PySpark error fixing MUST be performed by the `pyspark-engineer` agent. Do NOT attempt to fix errors directly.
+
+1. Launch the `pyspark-engineer` agent with:
+   - Full error stack trace and context
+   - Target file path
+   - All relevant schema information from MCP server
+   - Data dictionary references
+
+2. The pyspark-engineer will:
+   - Validate MCP server connectivity
+   - Query schemas and foreign key relationships
+   - Analyze and fix all errors systematically
+   - Apply fixes following project coding standards
+   - Run quality gates (py_compile, ruff check, ruff format)
+
+### Phase 2: Code Review with code-reviewer
+**CRITICAL**: After pyspark-engineer completes fixes, MUST launch the `code-reviewer` agent.
+
+1. Launch the `code-reviewer` agent with:
+   - Path to the fixed file(s)
+   - Context: "PySpark gold layer error fixes"
+   - Request comprehensive review focusing on:
+     - PySpark best practices
+     - Join logic correctness
+     - Schema alignment
+     - Business rule implementation
+     - Code quality and standards adherence
+
+2. The code-reviewer will provide:
+   - Detailed feedback on all issues found
+   - Security vulnerabilities
+   - Performance optimization opportunities
+   - Code quality improvements needed
+
+### Phase 3: Iterative Refinement (MANDATORY LOOP)
+**CRITICAL**: The review-refactor cycle MUST continue until code-reviewer is 100% satisfied.
+
+1. If code-reviewer identifies ANY issues:
+   - Launch pyspark-engineer again with code-reviewer's feedback
+   - pyspark-engineer implements all recommended changes
+   - Launch code-reviewer again to re-validate
+
+2. Repeat Phase 1 → Phase 2 → Phase 3 until:
+   - code-reviewer explicitly states: "✓ 100% SATISFIED - No further changes required"
+   - Zero issues, warnings, or concerns remain
+   - All quality gates pass
+   - All business rules validated
+
+3. Only then is the error fixing task complete.
+
+**DO NOT PROCEED TO COMPLETION** until code-reviewer gives explicit 100% satisfaction confirmation.
+
+## Pre-Execution Requirements
+
+### 1. Python Coding Standards (CRITICAL - READ FIRST)
+**MANDATORY**: All code MUST follow `.claude/rules/python_rules.md` standards:
+- **Line 19**: Use DataFrame API not Spark SQL
+- **Line 20**: Do NOT use DataFrame aliases (e.g., `.alias("l")`) or `col()` function - use direct string references or `df["column"]` syntax
+- **Line 8**: Limit line length to 240 characters
+- **Line 9-10**: Single line per statement, no carriage returns mid-statement
+- **Line 10, 12**: No blank lines inside functions
+- **Line 11**: Close parentheses on the last line of code
+- **Line 5**: Use type hints for all function parameters and return values
+- **Line 18**: Import statements only at the start of file, never inside functions
+- **Line 16**: Run `ruff check` and `ruff format` before finalizing
+- Import only necessary PySpark functions: `from pyspark.sql.functions import when, coalesce, lit` (NO col() usage - use direct references instead)
+
+### 2. Identify Target File
+- Default target: `python_files/gold/<INSERT FILE NAME>.py`
+- Override via Makefile: `G_RUN_FILE_NAME` variable (line 63)
+- Verify file exists before execution
+
+### 3. Environment Context
+- **Runtime Environment**: Local development (not Azure Synapse)
+- **Working Directory**: `/workspaces/unify_2_1_dm_synapse_env_d10`
+- **Python Version**: 3.11+
+- **Spark Mode**: Local cluster (`local[*]`)
+- **Data Location**: `/workspaces/data` (parquet files)
+
+### 4. Available Resources
+- **Data Dictionary**: `.claude/data_dictionary/*.md` - schema definitions for all CMS, FVMS, NicheRMS tables
+- **Configuration**: `configuration.yaml` - database lists, null replacements, Azure settings
+- **MCP Schema Server**: `mcp-server-motherduck` - live schema access via MCP (REQUIRED for schema verification)
+- **Utilities Module**: `python_files/utilities/session_optimiser.py` - TableUtilities, NotebookLogger, decorators
+- **Example Files**: Other `python_files/gold/g_*.py` files for reference patterns
+
+### 5. MCP Server Validation (CRITICAL)
+**BEFORE PROCEEDING**, verify MCP server connectivity:
+
+1. **Test MCP Server Connection**:
+   - Attempt to query any known table schema via MCP
+   - Example test: Query schema for a common table (e.g., `silver_cms.s_cms_offence_report`)
+
+2. **Validation Criteria**:
+   - MCP server must respond with valid schema data
+   - Schema must include column names, data types, and nullability
+   - Response must be recent (not cached/stale data)
+
+3. **Failure Handling**:
+   ```
+   ⚠️  STOP: MCP Server Not Available
+
+   The MCP server (mcp-server-motherduck) is not responding or not providing valid schema data.
+
+   This command requires live schema access to:
+   - Verify column names and data types
+   - Validate join key compatibility
+   - Check foreign key relationships
+   - Ensure accurate schema matching
+
+   Actions Required:
+   1. Check MCP server status and configuration
+   2. Verify MotherDuck connection credentials
+   3. Ensure schema database is accessible
+   4. Restart MCP server if necessary
+
+   Cannot proceed with error fixing without verified schema access.
+   Use data dictionary files as fallback, but warn user of potential schema drift.
+   ```
+
+4. **Success Confirmation**:
+   ```
+   ✓ MCP Server Connected
+   ✓ Schema data available
+   ✓ Proceeding with error fixing workflow
+   ```
+
+## Error Detection Strategy
+
+### Phase 1: Execute and Capture Errors
+1. Run: `make gold_table`
+2. Capture full stack trace including:
+   - Error type (AttributeError, KeyError, AnalysisException, etc.)
+   - Line number and function name
+   - Failed DataFrame operation
+   - Column names involved
+   - Join conditions if applicable
+
+### Phase 2: Categorize Error Types
+
+#### A. Code-Based Errors
+
+**Syntax/Import Errors**
+- Missing imports from `pyspark.sql.functions`
+- Incorrect function signatures
+- Type hint violations
+- Decorator usage errors
+
+**Runtime Errors**
+- `AnalysisException`: Column not found, table doesn't exist
+- `AttributeError`: Calling non-existent DataFrame methods
+- `KeyError`: Dictionary access failures
+- `TypeError`: Incompatible data types in operations
+
+**DataFrame Schema Errors**
+- Column name mismatches (case sensitivity)
+- Duplicate column names after joins
+- Missing required columns for downstream operations
+- Incorrect column aliases
+
+#### B. Logical Errors
+
+**Join Issues**
+- **Incorrect Join Keys**: Joining on wrong columns (e.g., `offence_report_id` vs `cms_offence_report_id`)
+- **Missing Table Aliases**: Ambiguous column references after joins
+- **Wrong Join Types**: Using `inner` when `left` is required (or vice versa)
+- **Cartesian Products**: Missing join conditions causing data explosion
+- **Broadcast Misuse**: Not using `broadcast()` for small dimension tables
+- **Duplicate Join Keys**: Multiple rows with same key causing row multiplication
+
+**Aggregation Problems**
+- Incorrect `groupBy()` columns
+- Missing aggregation functions (`first()`, `last()`, `collect_list()`)
+- Wrong window specifications
+- Aggregating on nullable columns without `coalesce()`
+
+**Business Rule Violations**
+- Incorrect date/time logic (e.g., using `reported_date_time` when `date_created` should be fallback)
+- Missing null handling for critical fields
+- Status code logic errors
+- Incorrect coalesce order
+
+**Data Quality Issues**
+- Expected vs actual row counts (use `logger.info(f"Expected X rows, got {df.count()}")`)
+- Null propagation in critical columns
+- Duplicate records not being handled
+- Missing deduplication logic
+
+## Systematic Debugging Process
+
+### Step 1: Schema Verification
+For each source table mentioned in the error:
+
+1. **PRIMARY: Query MCP Server for Schema** (MANDATORY FIRST STEP):
+   - Use MCP tools to query table schema from MotherDuck
+   - Extract column names, data types, nullability, and constraints
+   - Verify foreign key relationships for join operations
+   - Cross-reference with error column names
+
+   **Example MCP Query Pattern**:
+   ```
+   Query: "Get schema for table silver_cms.s_cms_offence_report"
+   Expected Response: Column list with types and constraints
+   ```
+
+   **If MCP Server Fails**:
+   - STOP and warn user (see Section 4: MCP Server Validation)
+   - Do NOT proceed with fixing without schema verification
+   - Suggest user check MCP server configuration
+
+2. **SECONDARY: Verify Schema Using Data Dictionary** (as supplementary reference):
+   - Read `.claude/data_dictionary/{source}_{table}.md`
+   - Compare MCP schema vs data dictionary for consistency
+   - Note any schema drift or discrepancies
+   - Alert user if schemas don't match
+
+3. **Check Table Existence**:
+   ```python
+   spark.sql("SHOW TABLES IN silver_cms").show()
+   ```
+
+4. **Inspect Actual Runtime Schema** (validate MCP data):
+   ```python
+   df = spark.read.table("silver_cms.s_cms_offence_report")
+   df.printSchema()
+   df.select([col for col in df.columns[:10]]).show(5, truncate=False)
+   ```
+
+   **Compare**:
+   - MCP schema vs Spark runtime schema
+   - Report any mismatches to user
+   - Use runtime schema as source of truth if conflicts exist
+
+5. **Use DuckDB Schema** (if available, as additional validation):
+   - Query schema.db for column definitions
+   - Check foreign key relationships
+   - Validate join key data types
+   - Triangulate: MCP + DuckDB + Data Dictionary should align
+
+### Step 2: Join Logic Validation
+
+For each join operation:
+
+1. **Use MCP Server to Validate Join Relationships**:
+   - Query foreign key constraints from MCP schema server
+   - Identify correct join column names and data types
+   - Verify parent-child table relationships
+   - Confirm join key nullability (affects join results)
+
+   **Example MCP Queries**:
+   ```
+   Query: "Show foreign keys for table silver_cms.s_cms_offence_report"
+   Query: "What columns link s_cms_offence_report to s_cms_case_file?"
+   Query: "Get data type for column cms_offence_report_id in silver_cms.s_cms_offence_report"
+   ```
+
+   **If MCP Returns No Foreign Keys**:
+   - Fall back to data dictionary documentation
+   - Check `.claude/data_dictionary/` for relationship diagrams
+   - Manually verify join logic with business analyst
+
+2. **Verify Join Keys Exist** (using MCP-confirmed column names):
+   ```python
+   left_df.select("join_key_column").show(5)
+   right_df.select("join_key_column").show(5)
+   ```
+
+3. **Check Join Key Data Type Compatibility** (cross-reference with MCP schema):
+   ```python
+   # Verify types match MCP schema expectations
+   left_df.select("join_key_column").dtypes
+   right_df.select("join_key_column").dtypes
+   ```
+
+4. **Check Join Key Uniqueness**:
+   ```python
+   left_df.groupBy("join_key_column").count().filter("count > 1").show()
+   ```
+
+5. **Validate Join Type**:
+   - `left`: Keep all left records (most common for fact-to-dimension)
+   - `inner`: Only matching records
+   - Use `broadcast()` for small lookup tables (< 10MB)
+   - Confirm join type matches MCP foreign key relationship (nullable FK → left join)
+
+6. **Handle Ambiguous Columns**:
+   ```python
+   # BEFORE (causes ambiguity if both tables have same column names)
+   joined_df = left_df.join(right_df, on="common_id", how="left")
+
+   # AFTER (select specific columns to avoid ambiguity)
+   left_cols = [c for c in left_df.columns]
+   right_cols = ["dimension_field"]
+   joined_df = left_df.join(right_df, on="common_id", how="left").select(left_cols + right_cols)
+   ```
+
+### Step 3: Aggregation Verification
+
+1. **Check groupBy Columns**:
+   - Must include all columns not being aggregated
+   - Verify columns exist in DataFrame
+
+2. **Validate Aggregation Functions**:
+   ```python
+   from pyspark.sql.functions import min, max, first, count, sum, coalesce, lit
+
+   aggregated = df.groupBy("key").agg(min("date_column").alias("earliest_date"), max("date_column").alias("latest_date"), first("dimension_column", ignorenulls=True).alias("dimension"), count("*").alias("record_count"), coalesce(sum("amount"), lit(0)).alias("total_amount"))
+   ```
+
+3. **Test Aggregation Logic**:
+   - Run aggregation on small sample
+   - Compare counts before/after
+   - Check for unexpected nulls
+
+### Step 4: Business Rule Testing
+
+1. **Verify Timestamp Logic**:
+   ```python
+   from pyspark.sql.functions import when
+
+   df.select("reported_date_time", "date_created", when(df["reported_date_time"].isNotNull(), df["reported_date_time"]).otherwise(df["date_created"]).alias("final_timestamp")).show(10)
+   ```
+
+2. **Test Null Handling**:
+   ```python
+   from pyspark.sql.functions import coalesce, lit
+
+   df.select("primary_field", "fallback_field", coalesce(df["primary_field"], df["fallback_field"], lit(0)).alias("result")).show(10)
+   ```
+
+3. **Validate Status/Lookup Logic**:
+   - Check status code mappings against data dictionary
+   - Verify conditional logic matches business requirements
+
+## Common Error Patterns and Fixes
+
+### Pattern 1: Column Not Found After Join
+**Error**: `AnalysisException: Column 'offence_report_id' not found`
+
+**Root Cause**: Incorrect column name - verify column exists using MCP schema
+
+**Fix**:
+```python
+# BEFORE - wrong column name
+df = left_df.join(right_df, on="offence_report_id", how="left")
+
+# AFTER - MCP-verified correct column name
+df = left_df.join(right_df, on="cms_offence_report_id", how="left")
+
+# If joining on different column names between tables:
+df = left_df.join(
+    right_df,
+    left_df["cms_offence_report_id"] == right_df["offence_report_id"],
+    how="left"
+)
+```
+
+### Pattern 2: Duplicate Column Names
+**Error**: Multiple columns with same name causing selection issues
+
+**Fix**:
+```python
+# BEFORE - causes duplicate 'id' column
+joined = left_df.join(right_df, left_df["id"] == right_df["id"], how="left")
+
+# AFTER - drop duplicate from right table before join
+right_df_clean = right_df.drop("id")
+joined = left_df.join(right_df_clean, left_df["id"] == right_df["id"], how="left")
+
+# OR - rename columns to avoid duplicates
+right_df_renamed = right_df.withColumnRenamed("id", "related_id")
+joined = left_df.join(right_df_renamed, left_df["id"] == right_df_renamed["related_id"], how="left")
+```
+
+### Pattern 3: Incorrect Aggregation
+**Error**: Column not in GROUP BY causing aggregation failure
+
+**Fix**:
+```python
+from pyspark.sql.functions import min, first
+
+# BEFORE - non-aggregated column not in groupBy
+df.groupBy("key1").agg(min("date_field"), "non_aggregated_field")
+
+# AFTER - all non-grouped columns must be aggregated
+df = df.groupBy("key1").agg(min("date_field").alias("min_date"), first("non_aggregated_field", ignorenulls=True).alias("non_aggregated_field"))
+```
+
+### Pattern 4: Join Key Mismatch
+**Error**: No matching records or unexpected cartesian product
+
+**Fix**:
+```python
+left_df.select("join_key").show(20)
+right_df.select("join_key").show(20)
+left_df.select("join_key").dtypes
+right_df.select("join_key").dtypes
+left_df.filter(left_df["join_key"].isNull()).count()
+right_df.filter(right_df["join_key"].isNull()).count()
+result = left_df.join(right_df, left_df["join_key"].cast("int") == right_df["join_key"].cast("int"), how="left")
+```
+
+### Pattern 5: Missing Null Handling
+**Error**: Unexpected nulls propagating through transformations
+
+**Fix**:
+```python
+from pyspark.sql.functions import coalesce, lit
+
+# BEFORE - NULL if either field is NULL
+df = df.withColumn("result", df["field1"] + df["field2"])
+
+# AFTER - handle nulls with coalesce
+df = df.withColumn("result", coalesce(df["field1"], lit(0)) + coalesce(df["field2"], lit(0)))
+```
+
+## Validation Requirements
+
+After fixing errors, validate:
+
+1. **Row Counts**: Log and verify expected vs actual counts at each transformation
+2. **Schema**: Ensure output schema matches target table requirements
+3. **Nulls**: Check critical columns for unexpected nulls
+4. **Duplicates**: Verify uniqueness of ID columns
+5. **Data Ranges**: Check timestamp ranges and numeric bounds
+6. **Join Results**: Sample joined records to verify correctness
+
+## Logging Requirements
+
+Use `NotebookLogger` throughout:
+
+```python
+logger = NotebookLogger()
+
+# Start of operation
+logger.info(f"Starting extraction from {table_name}")
+
+# After DataFrame creation
+logger.info(f"Extracted {df.count()} records from {table_name}")
+
+# After join
+logger.info(f"Join completed: {joined_df.count()} records (expected ~X)")
+
+# After transformation
+logger.info(f"Transformation complete: {final_df.count()} records")
+
+# On error
+logger.error(f"Failed to process {table_name}: {error_message}")
+
+# On success
+logger.success(f"Successfully loaded {target_table_name}")
+```
+
+## Quality Gates (Must Run After Fixes)
+
+```bash
+# 1. Syntax validation
+python3 -m py_compile python_files/gold/g_x_mg_cms_mo.py
+
+# 2. Code quality check
+ruff check python_files/gold/g_x_mg_cms_mo.py
+
+# 3. Format code
+ruff format python_files/gold/g_x_mg_cms_mo.py
+
+# 4. Run fixed code
+make gold_table
+```
+
+## Key Principles for PySpark Engineer Agent
+
+1. **CRITICAL: Agent Workflow Required**: ALL error fixing must follow the 3-phase agent workflow (pyspark-engineer → code-reviewer → iterative refinement until 100% satisfied)
+2. **CRITICAL: Validate MCP Server First**: Before starting, verify MCP server connectivity and schema availability. STOP and warn user if unavailable.
+3. **Always Query MCP Schema First**: Use MCP server to get authoritative schema data before fixing any errors. Cross-reference with data dictionary.
+4. **Use MCP for Join Validation**: Query foreign key relationships from MCP to ensure correct join logic and column names.
+5. **DataFrame API Without Aliases or col()**: Use DataFrame API (NOT Spark SQL). NO DataFrame aliases. NO col() function. Use direct string references (e.g., `"column_name"`) or df["column"] syntax (e.g., `df["column_name"]`). Import only needed functions (e.g., `from pyspark.sql.functions import when, coalesce`)
+6. **Test Incrementally**: Fix one error at a time, validate, then proceed
+7. **Log Everything**: Add logging at every transformation step
+8. **Handle Nulls**: Always consider null cases in business logic (check MCP nullability constraints)
+9. **Verify Join Logic**: Check join keys, types, and uniqueness before implementing (use MCP data types)
+10. **Use Utilities**: Leverage `TableUtilities` methods (add_row_hash, save_as_table, clean_date_time_columns)
+11. **Follow Patterns**: Reference working gold layer files for established patterns
+12. **Validate Business Rules**: Confirm logic with MCP schema, data dictionary, and user story requirements
+13. **Clean Code**: Adhere to project standards (240 char line length, no blank lines in functions, type hints, single line per statement)
+14. **Triple-Check Schemas**: When schema mismatch occurs, verify MCP → Runtime → Data Dictionary alignment and report discrepancies
+15. **Code Review Loop**: Continue refactoring until code-reviewer explicitly confirms 100% satisfaction with zero remaining issues
+
+## Example Workflow with MCP Integration
+
+### Scenario: Fixing AnalysisException - Column 'offence_report_id' not found
+
+**Step 1: MCP Server Validation**
+```
+Agent: Testing MCP server connectivity...
+MCP Query: "List databases available"
+MCP Response: ✓ [bronze_cms, silver_cms, silver_fvms, gold_data_model, ...]
+
+Agent: ✓ MCP Server Connected
+Agent: ✓ Schema data available
+Agent: ✓ Proceeding with error fixing workflow
+```
+
+**Step 2: Query Schema via MCP**
+```
+Agent: Querying schema for silver_cms.s_cms_offence_report via MCP...
+MCP Query: "DESCRIBE TABLE silver_cms.s_cms_offence_report"
+MCP Response:
+  - cms_offence_report_id (BIGINT, PRIMARY KEY, NOT NULL)
+  - case_file_id (BIGINT, FOREIGN KEY → s_cms_case_file.cms_case_file_id, NULLABLE)
+  - reported_date_time (TIMESTAMP, NULLABLE)
+  - date_created (TIMESTAMP, NOT NULL)
+  - status_code (VARCHAR, NULLABLE)
+  ... (50+ more columns)
+
+Agent: Found correct column name: 'cms_offence_report_id' (not 'offence_report_id')
+```
+
+**Step 3: Verify Join Relationship via MCP**
+```
+Agent: Checking foreign key relationship for join...
+MCP Query: "Show foreign keys from s_cms_offence_report to s_cms_case_file"
+MCP Response:
+  - s_cms_offence_report.case_file_id → s_cms_case_file.cms_case_file_id
+  - Relationship: MANY-TO-ONE
+  - Nullability: case_file_id is NULLABLE → Use LEFT JOIN
+
+Agent: ✓ Join should be: LEFT JOIN on case_file_id = cms_case_file_id
+```
+
+**Step 4: Apply Fix with MCP-Verified Schema**
+```python
+# BEFORE (error)
+offence_df = spark.read.table("silver_cms.s_cms_offence_report")
+case_df = spark.read.table("silver_cms.s_cms_case_file")
+joined = offence_df.join(case_df, on="offence_report_id", how="left")
+
+# AFTER (MCP-verified) - Use DataFrame API with correct column names
+# MCP-confirmed FK: case_file_id → cms_case_file_id
+# MCP-confirmed nullable FK → LEFT JOIN
+# MCP-confirmed PK: cms_offence_report_id
+case_df_subset = case_df.select("cms_case_file_id", "case_file_number")
+joined = offence_df.join(case_df_subset, offence_df["case_file_id"] == case_df_subset["cms_case_file_id"], how="left")
+joined = joined.withColumnRenamed("cms_offence_report_id", "offence_report_id")
+```
+
+**Step 5: Validate Fix**
+```
+Agent: Running fixed code...
+Agent: ✓ No AnalysisException
+Agent: ✓ Join executed successfully
+Agent: Row count: 45,823 (expected ~45,000)
+Agent: ✓ Schema matches MCP expectations
+```
+
+## Success Criteria
+
+### Phase 1: Initial Error Fixing (pyspark-engineer)
+- [ ] **MCP Server validated and responding** (MANDATORY FIRST CHECK)
+- [ ] Schema verified via MCP server for all source tables
+- [ ] Foreign key relationships confirmed via MCP queries
+- [ ] All syntax errors resolved
+- [ ] All runtime errors fixed
+- [ ] Join logic validated and correct (using MCP-confirmed column names and types)
+- [ ] DataFrame API used (NOT Spark SQL) per python_rules.md line 19
+- [ ] NO DataFrame aliases or col() function used - direct string references or df["column"] syntax only (per python_rules.md line 20)
+- [ ] Code follows python_rules.md standards: 240 char lines, no blank lines in functions, single line per statement, imports at top only
+- [ ] Row counts logged and reasonable
+- [ ] Business rules implemented correctly
+- [ ] Output schema matches requirements (cross-referenced with MCP schema)
+- [ ] Code passes quality gates (py_compile, ruff check, ruff format)
+- [ ] `make gold_table` executes successfully
+- [ ] Target table created/updated in `gold_data_model` database
+- [ ] No schema drift reported between MCP, Runtime, and Data Dictionary sources
+
+### Phase 2: Code Review (code-reviewer)
+- [ ] code-reviewer agent launched with fixed code
+- [ ] Comprehensive review completed covering:
+  - [ ] PySpark best practices adherence
+  - [ ] Join logic correctness
+  - [ ] Schema alignment validation
+  - [ ] Business rule implementation accuracy
+  - [ ] Code quality and standards compliance
+  - [ ] Security vulnerabilities (none found)
+  - [ ] Performance optimization opportunities addressed
+
+### Phase 3: Iterative Refinement (MANDATORY UNTIL 100% SATISFIED)
+- [ ] All code-reviewer feedback items addressed by pyspark-engineer
+- [ ] Re-review completed by code-reviewer
+- [ ] Iteration cycle repeated until code-reviewer explicitly confirms:
+  - [ ] **"✓ 100% SATISFIED - No further changes required"**
+  - [ ] Zero remaining issues, warnings, or concerns
+  - [ ] All quality gates pass
+  - [ ] All business rules validated
+  - [ ] Code meets production-ready standards
+
+### Final Approval
+- [ ] **code-reviewer has explicitly confirmed 100% satisfaction**
+- [ ] No outstanding issues or concerns remain
+- [ ] Task is complete and ready for production deployment