--- name: project-architecture description: Detailed architecture, data flow, pipeline execution, dependencies, and system design for the Unify data migration project. Use when you need deep understanding of how components interact. --- # Project Architecture Comprehensive architecture documentation for the Unify data migration project. ## Medallion Architecture Deep Dive ### Bronze Layer **Purpose**: Raw data ingestion from parquet files **Location**: `python_files/pipeline_operations/bronze_layer_deployment.py` **Process**: 1. Lists parquet files from Azure ADLS Gen2 or local storage 2. Creates bronze databases: `bronze_cms`, `bronze_fvms`, `bronze_nicherms` 3. Reads parquet files and applies basic transformations 4. Adds versioning, row hashes, and data source columns ### Silver Layer **Purpose**: Validated, standardized data organized by source **Location**: `python_files/silver/` (cms, fvms, nicherms subdirectories) **Process**: 1. Drops and recreates silver databases 2. Recursively finds all Python files in `python_files/silver/` 3. Executes each silver transformation file in sorted order 4. Uses threading for parallel execution (currently commented out) ### Gold Layer **Purpose**: Business-ready, aggregated analytical datasets **Location**: `python_files/gold/` **Process**: 1. Creates business-ready analytical tables in `gold_data_model` database 2. Executes transformations from `python_files/gold/` 3. Aggregates and joins data across multiple silver tables ## Data Sources ### FVMS (Family Violence Management System) - **Tables**: 32 tables - **Key tables**: incident, person, address, risk_assessment - **Purpose**: Family violence incident tracking and management ### CMS (Case Management System) - **Tables**: 19 tables - **Key tables**: offence_report, case_file, person, victim - **Purpose**: Criminal offence investigation and case management ### NicheRMS (Records Management System) - **Tables**: 39 TBL_* tables - **Purpose**: Legacy records management system ## Azure Integration ### Storage (ADLS Gen2) - **Containers**: `bronze-layer`, `code-layer`, `legacy_ingestion` - **Authentication**: Managed Identity (`AZURE_MANAGED_IDENTITY_CLIENT_ID`) - **Path Pattern**: `abfss://container@account.dfs.core.windows.net/path` ### Key Services - **Key Vault**: `AuE-DataMig-Dev-KV` for secret management - **Synapse Workspace**: `auedatamigdevsynws` - **Spark Pool**: `dm8c64gb` ## Environment Detection Pattern All processing scripts auto-detect their runtime environment: ```python if "/home/trusted-service-user" == env_vars["HOME"]: # Azure Synapse Analytics production environment import notebookutils.mssparkutils as mssparkutils spark = SparkOptimiser.get_optimised_spark_session() DATA_PATH_STRING = "abfss://code-layer@auedatamigdevlake.dfs.core.windows.net" else: # Local development environment using Docker Spark container from python_files.utilities.local_spark_connection import sparkConnector config = UtilityFunctions.get_settings_from_yaml("configuration.yaml") connector = sparkConnector(...) DATA_PATH_STRING = config["DATA_PATH_STRING"] ``` ## Core Utilities Architecture ### SparkOptimiser - Configured Spark session with optimized settings - Handles driver memory, encryption, authentication - Centralized session management ### NotebookLogger - Rich console logging with fallback to standard print - Structured logging (info, warning, error, success) - Graceful degradation when Rich library unavailable ### TableUtilities - DataFrame operations (deduplication, hashing, timestamp conversion) - `add_row_hash()`: Change detection - `save_as_table()`: Standard table save with timestamp conversion - `clean_date_time_columns()`: Intelligent timestamp parsing - `drop_duplicates_simple/advanced()`: Deduplication strategies - `filter_and_drop_column()`: Remove duplicate flags ### DAGMonitor - Pipeline execution tracking and reporting - Performance metrics and logging ## Configuration Management ### configuration.yaml Central YAML configuration includes: - **Data Sources**: FVMS, CMS, NicheRMS table lists (`*_IN_SCOPE` variables) - **Azure Settings**: Storage accounts, Key Vault, Synapse workspace, subscription IDs - **Spark Settings**: Driver, encryption, authentication scheme - **Data Paths**: Local (`/workspaces/data`) vs Azure (`abfss://`) - **Logging**: LOG_LEVEL, LOG_ROTATION, LOG_RETENTION - **Nulls Handling**: STRING_NULL_REPLACEMENT, NUMERIC_NULL_REPLACEMENT, TIMESTAMP_NULL_REPLACEMENT ## Error Handling Strategy - **Decorator-Based**: `@synapse_error_print_handler` for consistent error handling - **Loguru Integration**: Structured logging with proper levels - **Graceful Degradation**: Handle missing dependencies (Rich library fallback) - **Context Information**: Include table/database names in all log messages ## Local Data Filtering `TableUtilities.save_as_table()` automatically filters to last N years when `date_created` column exists, controlled by `NUMBER_OF_YEARS` global variable in `session_optimiser.py`. Prevents full dataset processing in local development. ## Testing Architecture ### Test Structure - `python_files/testing/`: Unit and integration tests - `medallion_testing.py`: Full pipeline validation - `bronze_layer_validation.py`: Bronze layer tests - `ingestion_layer_validation.py`: Ingestion tests ### Testing Strategy - pytest integration with PySpark environments - Quality gates: syntax validation and linting before completion - Integration tests for full medallion flow ## DuckDB Integration After running pipelines, build local DuckDB database for fast SQL analysis: - **File**: `/workspaces/data/warehouse.duckdb` - **Command**: `make build_duckdb` - **Purpose**: Fast local queries without Azure connection - **Contains**: All bronze, silver, gold layer tables ## Recent Architectural Changes ### Path Migration - Standardized all paths to use `unify_2_1_dm_synapse_env_d10` - Improved portability and environment consistency - 12 files updated across utilities, notebooks, configurations ### Code Cleanup - Removed unused utilities: `file_executor.py`, `file_finder.py` - Reduced codebase complexity - Regular cleanup pattern for maintainability