Initial commit

2025-11-30 08:37:55 +08:00
commit 506a828b22
59 changed files with 18515 additions and 0 deletions
--- a/skills/project-architecture.md
+++ b/skills/project-architecture.md
@@ -0,0 +1,161 @@
+---
+name: project-architecture
+description: Detailed architecture, data flow, pipeline execution, dependencies, and system design for the Unify data migration project. Use when you need deep understanding of how components interact.
+---
+
+# Project Architecture
+
+Comprehensive architecture documentation for the Unify data migration project.
+
+## Medallion Architecture Deep Dive
+
+### Bronze Layer
+**Purpose**: Raw data ingestion from parquet files
+**Location**: `python_files/pipeline_operations/bronze_layer_deployment.py`
+**Process**:
+1. Lists parquet files from Azure ADLS Gen2 or local storage
+2. Creates bronze databases: `bronze_cms`, `bronze_fvms`, `bronze_nicherms`
+3. Reads parquet files and applies basic transformations
+4. Adds versioning, row hashes, and data source columns
+
+### Silver Layer
+**Purpose**: Validated, standardized data organized by source
+**Location**: `python_files/silver/` (cms, fvms, nicherms subdirectories)
+**Process**:
+1. Drops and recreates silver databases
+2. Recursively finds all Python files in `python_files/silver/`
+3. Executes each silver transformation file in sorted order
+4. Uses threading for parallel execution (currently commented out)
+
+### Gold Layer
+**Purpose**: Business-ready, aggregated analytical datasets
+**Location**: `python_files/gold/`
+**Process**:
+1. Creates business-ready analytical tables in `gold_data_model` database
+2. Executes transformations from `python_files/gold/`
+3. Aggregates and joins data across multiple silver tables
+
+## Data Sources
+
+### FVMS (Family Violence Management System)
+- **Tables**: 32 tables
+- **Key tables**: incident, person, address, risk_assessment
+- **Purpose**: Family violence incident tracking and management
+
+### CMS (Case Management System)
+- **Tables**: 19 tables
+- **Key tables**: offence_report, case_file, person, victim
+- **Purpose**: Criminal offence investigation and case management
+
+### NicheRMS (Records Management System)
+- **Tables**: 39 TBL_* tables
+- **Purpose**: Legacy records management system
+
+## Azure Integration
+
+### Storage (ADLS Gen2)
+- **Containers**: `bronze-layer`, `code-layer`, `legacy_ingestion`
+- **Authentication**: Managed Identity (`AZURE_MANAGED_IDENTITY_CLIENT_ID`)
+- **Path Pattern**: `abfss://container@account.dfs.core.windows.net/path`
+
+### Key Services
+- **Key Vault**: `AuE-DataMig-Dev-KV` for secret management
+- **Synapse Workspace**: `auedatamigdevsynws`
+- **Spark Pool**: `dm8c64gb`
+
+## Environment Detection Pattern
+
+All processing scripts auto-detect their runtime environment:
+
+```python
+if "/home/trusted-service-user" == env_vars["HOME"]:
+    # Azure Synapse Analytics production environment
+    import notebookutils.mssparkutils as mssparkutils
+    spark = SparkOptimiser.get_optimised_spark_session()
+    DATA_PATH_STRING = "abfss://code-layer@auedatamigdevlake.dfs.core.windows.net"
+else:
+    # Local development environment using Docker Spark container
+    from python_files.utilities.local_spark_connection import sparkConnector
+    config = UtilityFunctions.get_settings_from_yaml("configuration.yaml")
+    connector = sparkConnector(...)
+    DATA_PATH_STRING = config["DATA_PATH_STRING"]
+```
+
+## Core Utilities Architecture
+
+### SparkOptimiser
+- Configured Spark session with optimized settings
+- Handles driver memory, encryption, authentication
+- Centralized session management
+
+### NotebookLogger
+- Rich console logging with fallback to standard print
+- Structured logging (info, warning, error, success)
+- Graceful degradation when Rich library unavailable
+
+### TableUtilities
+- DataFrame operations (deduplication, hashing, timestamp conversion)
+- `add_row_hash()`: Change detection
+- `save_as_table()`: Standard table save with timestamp conversion
+- `clean_date_time_columns()`: Intelligent timestamp parsing
+- `drop_duplicates_simple/advanced()`: Deduplication strategies
+- `filter_and_drop_column()`: Remove duplicate flags
+
+### DAGMonitor
+- Pipeline execution tracking and reporting
+- Performance metrics and logging
+
+## Configuration Management
+
+### configuration.yaml
+Central YAML configuration includes:
+- **Data Sources**: FVMS, CMS, NicheRMS table lists (`*_IN_SCOPE` variables)
+- **Azure Settings**: Storage accounts, Key Vault, Synapse workspace, subscription IDs
+- **Spark Settings**: Driver, encryption, authentication scheme
+- **Data Paths**: Local (`/workspaces/data`) vs Azure (`abfss://`)
+- **Logging**: LOG_LEVEL, LOG_ROTATION, LOG_RETENTION
+- **Nulls Handling**: STRING_NULL_REPLACEMENT, NUMERIC_NULL_REPLACEMENT, TIMESTAMP_NULL_REPLACEMENT
+
+## Error Handling Strategy
+
+- **Decorator-Based**: `@synapse_error_print_handler` for consistent error handling
+- **Loguru Integration**: Structured logging with proper levels
+- **Graceful Degradation**: Handle missing dependencies (Rich library fallback)
+- **Context Information**: Include table/database names in all log messages
+
+## Local Data Filtering
+
+`TableUtilities.save_as_table()` automatically filters to last N years when `date_created` column exists, controlled by `NUMBER_OF_YEARS` global variable in `session_optimiser.py`. Prevents full dataset processing in local development.
+
+## Testing Architecture
+
+### Test Structure
+- `python_files/testing/`: Unit and integration tests
+- `medallion_testing.py`: Full pipeline validation
+- `bronze_layer_validation.py`: Bronze layer tests
+- `ingestion_layer_validation.py`: Ingestion tests
+
+### Testing Strategy
+- pytest integration with PySpark environments
+- Quality gates: syntax validation and linting before completion
+- Integration tests for full medallion flow
+
+## DuckDB Integration
+
+After running pipelines, build local DuckDB database for fast SQL analysis:
+- **File**: `/workspaces/data/warehouse.duckdb`
+- **Command**: `make build_duckdb`
+- **Purpose**: Fast local queries without Azure connection
+- **Contains**: All bronze, silver, gold layer tables
+
+## Recent Architectural Changes
+
+### Path Migration
+- Standardized all paths to use `unify_2_1_dm_synapse_env_d10`
+- Improved portability and environment consistency
+- 12 files updated across utilities, notebooks, configurations
+
+### Code Cleanup
+- Removed unused utilities: `file_executor.py`, `file_finder.py`
+- Reduced codebase complexity
+- Regular cleanup pattern for maintainability