Initial commit
This commit is contained in:
326
commands/write-tests.md
Executable file
326
commands/write-tests.md
Executable file
@@ -0,0 +1,326 @@
|
||||
---
|
||||
allowed-tools: Read, Write, Edit, Bash
|
||||
argument-hint: [target-file] | [test-type] | --unit | --integration | --data-validation | --medallion
|
||||
description: Write comprehensive pytest tests for PySpark data pipelines with live data validation
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# Write Tests - pytest + PySpark with Live Data
|
||||
|
||||
Write comprehensive pytest tests for PySpark data pipelines using **LIVE DATA** sources: **$ARGUMENTS**
|
||||
|
||||
## Current Testing Context
|
||||
|
||||
- Test framework: !`[ -f pytest.ini ] && echo "pytest configured" || echo "pytest setup needed"`
|
||||
- Target: $ARGUMENTS (file/layer to test)
|
||||
- Test location: !`ls -d tests/ test/ 2>/dev/null | head -1 || echo "tests/ (will create)"`
|
||||
- Live data available: Bronze/Silver/Gold layers with real FVMS, CMS, NicheRMS tables
|
||||
|
||||
## Core Principle: TEST WITH LIVE DATA
|
||||
|
||||
**ALWAYS use real data from Bronze/Silver/Gold layers**. No mocked data unless absolutely necessary.
|
||||
|
||||
## pytest Testing Framework
|
||||
|
||||
### 1. Test File Organization
|
||||
|
||||
```
|
||||
tests/
|
||||
├── conftest.py # Shared fixtures (Spark session, live data)
|
||||
├── test_bronze_ingestion.py # Bronze layer validation
|
||||
├── test_silver_transformations.py # Silver layer ETL
|
||||
├── test_gold_aggregations.py # Gold layer analytics
|
||||
├── test_utilities.py # TableUtilities, NotebookLogger
|
||||
└── integration/
|
||||
└── test_end_to_end_pipeline.py
|
||||
```
|
||||
|
||||
### 2. Essential pytest Fixtures (conftest.py)
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from pyspark.sql import SparkSession
|
||||
from python_files.utilities.session_optimiser import SparkOptimiser
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def spark():
|
||||
"""Shared Spark session for all tests - reuses SparkOptimiser"""
|
||||
session = SparkOptimiser.get_optimised_spark_session()
|
||||
yield session
|
||||
session.stop()
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def bronze_data(spark):
|
||||
"""Live bronze layer data - REAL DATA"""
|
||||
return spark.table("bronze_fvms.b_vehicle_master")
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def silver_data(spark):
|
||||
"""Live silver layer data - REAL DATA"""
|
||||
return spark.table("silver_fvms.s_vehicle_master")
|
||||
|
||||
@pytest.fixture
|
||||
def sample_live_data(bronze_data):
|
||||
"""Small sample from live data for fast tests"""
|
||||
return bronze_data.limit(100)
|
||||
```
|
||||
|
||||
### 3. pytest Test Patterns
|
||||
|
||||
#### Pattern 1: Unit Tests (Individual Functions)
|
||||
|
||||
```python
|
||||
# tests/test_utilities.py
|
||||
import pytest
|
||||
from python_files.utilities.session_optimiser import TableUtilities
|
||||
|
||||
class TestTableUtilities:
|
||||
def test_add_row_hash_creates_hash_column(self, spark, sample_live_data):
|
||||
"""Verify add_row_hash() creates hash_key column"""
|
||||
result = TableUtilities.add_row_hash(sample_live_data, ["vehicle_id"])
|
||||
assert "hash_key" in result.columns
|
||||
assert result.count() == sample_live_data.count()
|
||||
|
||||
def test_drop_duplicates_simple_removes_exact_duplicates(self, spark):
|
||||
"""Test deduplication on live data"""
|
||||
# Use LIVE data with known duplicates
|
||||
raw_data = spark.table("bronze_fvms.b_vehicle_events")
|
||||
result = TableUtilities.drop_duplicates_simple(raw_data)
|
||||
assert result.count() <= raw_data.count()
|
||||
|
||||
@pytest.mark.parametrize("date_col", ["created_date", "updated_date", "event_date"])
|
||||
def test_clean_date_time_columns_handles_all_formats(self, spark, bronze_data, date_col):
|
||||
"""Parameterized test for date cleaning"""
|
||||
if date_col in bronze_data.columns:
|
||||
result = TableUtilities.clean_date_time_columns(bronze_data, [date_col])
|
||||
assert date_col in result.columns
|
||||
```
|
||||
|
||||
#### Pattern 2: Integration Tests (End-to-End)
|
||||
|
||||
```python
|
||||
# tests/integration/test_end_to_end_pipeline.py
|
||||
import pytest
|
||||
from python_files.silver.fvms.s_vehicle_master import VehicleMaster
|
||||
|
||||
class TestSilverVehicleMasterPipeline:
|
||||
def test_full_etl_with_live_bronze_data(self, spark):
|
||||
"""Test complete Bronze → Silver transformation with LIVE data"""
|
||||
# Extract: Read LIVE bronze data
|
||||
bronze_table = "bronze_fvms.b_vehicle_master"
|
||||
bronze_df = spark.table(bronze_table)
|
||||
initial_count = bronze_df.count()
|
||||
|
||||
# Transform & Load: Run actual ETL class
|
||||
etl = VehicleMaster(bronze_table_name=bronze_table)
|
||||
|
||||
# Validate: Check LIVE silver output
|
||||
silver_df = spark.table("silver_fvms.s_vehicle_master")
|
||||
assert silver_df.count() > 0
|
||||
assert "hash_key" in silver_df.columns
|
||||
assert "load_timestamp" in silver_df.columns
|
||||
|
||||
# Data quality: No nulls in critical fields
|
||||
assert silver_df.filter("vehicle_id IS NULL").count() == 0
|
||||
```
|
||||
|
||||
#### Pattern 3: Data Validation (Live Data Checks)
|
||||
|
||||
```python
|
||||
# tests/test_data_validation.py
|
||||
import pytest
|
||||
|
||||
class TestBronzeLayerDataQuality:
|
||||
"""Validate live data quality in Bronze layer"""
|
||||
|
||||
def test_bronze_vehicle_master_has_recent_data(self, spark):
|
||||
"""Verify bronze layer contains recent records"""
|
||||
from pyspark.sql.functions import max, datediff, current_date
|
||||
|
||||
df = spark.table("bronze_fvms.b_vehicle_master")
|
||||
max_date = df.select(max("load_timestamp")).collect()[0][0]
|
||||
|
||||
# Data should be less than 30 days old
|
||||
assert (current_date() - max_date).days <= 30
|
||||
|
||||
def test_bronze_to_silver_row_counts_match_expectations(self, spark):
|
||||
"""Validate row count transformation logic"""
|
||||
bronze = spark.table("bronze_fvms.b_vehicle_master")
|
||||
silver = spark.table("silver_fvms.s_vehicle_master")
|
||||
|
||||
# After deduplication, silver <= bronze
|
||||
assert silver.count() <= bronze.count()
|
||||
|
||||
@pytest.mark.slow
|
||||
def test_hash_key_uniqueness_on_live_data(self, spark):
|
||||
"""Verify hash_key uniqueness in Silver layer (full scan)"""
|
||||
df = spark.table("silver_fvms.s_vehicle_master")
|
||||
total = df.count()
|
||||
unique = df.select("hash_key").distinct().count()
|
||||
|
||||
assert total == unique, f"Duplicate hash_keys found: {total - unique}"
|
||||
```
|
||||
|
||||
#### Pattern 4: Schema Validation
|
||||
|
||||
```python
|
||||
# tests/test_schema_validation.py
|
||||
import pytest
|
||||
from pyspark.sql.types import StringType, IntegerType, TimestampType
|
||||
|
||||
class TestSchemaConformance:
|
||||
def test_silver_vehicle_schema_matches_expected(self, spark):
|
||||
"""Validate Silver layer schema against business requirements"""
|
||||
df = spark.table("silver_fvms.s_vehicle_master")
|
||||
schema_dict = {field.name: field.dataType for field in df.schema.fields}
|
||||
|
||||
# Critical fields must exist
|
||||
assert "vehicle_id" in schema_dict
|
||||
assert "hash_key" in schema_dict
|
||||
assert "load_timestamp" in schema_dict
|
||||
|
||||
# Type validation
|
||||
assert isinstance(schema_dict["vehicle_id"], StringType)
|
||||
assert isinstance(schema_dict["load_timestamp"], TimestampType)
|
||||
```
|
||||
|
||||
### 4. pytest Markers & Configuration
|
||||
|
||||
**pytest.ini**:
|
||||
```ini
|
||||
[tool:pytest]
|
||||
testpaths = tests
|
||||
python_files = test_*.py
|
||||
python_classes = Test*
|
||||
python_functions = test_*
|
||||
markers =
|
||||
slow: marks tests as slow (deselect with '-m "not slow"')
|
||||
integration: marks tests as integration tests
|
||||
unit: marks tests as unit tests
|
||||
live_data: tests that require live data access
|
||||
addopts =
|
||||
-v
|
||||
--tb=short
|
||||
--strict-markers
|
||||
--disable-warnings
|
||||
```
|
||||
|
||||
**Run specific test types**:
|
||||
```bash
|
||||
pytest tests/test_utilities.py -v # Single file
|
||||
pytest -m unit # Only unit tests
|
||||
pytest -m "not slow" # Skip slow tests
|
||||
pytest -k "vehicle" # Tests matching "vehicle"
|
||||
pytest --maxfail=1 # Stop on first failure
|
||||
pytest -n auto # Parallel execution (pytest-xdist)
|
||||
```
|
||||
|
||||
### 5. Advanced pytest Features
|
||||
|
||||
#### Parametrized Tests
|
||||
```python
|
||||
@pytest.mark.parametrize("table_name,expected_min_count", [
|
||||
("bronze_fvms.b_vehicle_master", 1000),
|
||||
("bronze_cms.b_customer_master", 500),
|
||||
("bronze_nicherms.b_booking_master", 2000),
|
||||
])
|
||||
def test_bronze_tables_have_minimum_rows(spark, table_name, expected_min_count):
|
||||
"""Validate minimum row counts across multiple live tables"""
|
||||
df = spark.table(table_name)
|
||||
assert df.count() >= expected_min_count
|
||||
```
|
||||
|
||||
#### Fixtures with Live Data Sampling
|
||||
```python
|
||||
@pytest.fixture
|
||||
def stratified_sample(bronze_data):
|
||||
"""Stratified sample from live data for statistical tests"""
|
||||
from pyspark.sql.functions import col
|
||||
return bronze_data.sampleBy("vehicle_type", fractions={"Car": 0.1, "Truck": 0.1})
|
||||
```
|
||||
|
||||
### 6. Testing Best Practices
|
||||
|
||||
**DO**:
|
||||
- ✅ Use `spark.table()` to read LIVE Bronze/Silver/Gold data
|
||||
- ✅ Test with `.limit(100)` for speed, full dataset for validation
|
||||
- ✅ Use `@pytest.fixture(scope="session")` for Spark session (reuse)
|
||||
- ✅ Test actual ETL classes (e.g., `VehicleMaster()`)
|
||||
- ✅ Validate data quality (nulls, duplicates, date ranges)
|
||||
- ✅ Use `pytest.mark.parametrize` for testing multiple tables
|
||||
- ✅ Clean up test outputs in teardown fixtures
|
||||
|
||||
**DON'T**:
|
||||
- ❌ Create mock/fake data (use real data samples)
|
||||
- ❌ Skip testing because "data is too large" (use `.limit()`)
|
||||
- ❌ Write tests that modify production tables
|
||||
- ❌ Ignore schema validation
|
||||
- ❌ Forget to test error handling with real edge cases
|
||||
|
||||
### 7. Example: Complete Test File
|
||||
|
||||
```python
|
||||
# tests/test_silver_vehicle_master.py
|
||||
import pytest
|
||||
from pyspark.sql.functions import col, count, when
|
||||
from python_files.silver.fvms.s_vehicle_master import VehicleMaster
|
||||
|
||||
class TestSilverVehicleMaster:
|
||||
"""Test Silver layer VehicleMaster ETL with LIVE data"""
|
||||
|
||||
@pytest.fixture(scope="class")
|
||||
def silver_df(self, spark):
|
||||
"""Live Silver data - computed once per test class"""
|
||||
return spark.table("silver_fvms.s_vehicle_master")
|
||||
|
||||
def test_all_required_columns_exist(self, silver_df):
|
||||
"""Validate schema completeness"""
|
||||
required = ["vehicle_id", "hash_key", "load_timestamp", "registration_number"]
|
||||
missing = [col for col in required if col not in silver_df.columns]
|
||||
assert not missing, f"Missing columns: {missing}"
|
||||
|
||||
def test_no_nulls_in_primary_key(self, silver_df):
|
||||
"""Primary key cannot be null"""
|
||||
null_count = silver_df.filter(col("vehicle_id").isNull()).count()
|
||||
assert null_count == 0
|
||||
|
||||
def test_hash_key_generated_for_all_rows(self, silver_df):
|
||||
"""Every row must have hash_key"""
|
||||
total = silver_df.count()
|
||||
with_hash = silver_df.filter(col("hash_key").isNotNull()).count()
|
||||
assert total == with_hash
|
||||
|
||||
@pytest.mark.slow
|
||||
def test_deduplication_effectiveness(self, spark):
|
||||
"""Compare Bronze vs Silver row counts"""
|
||||
bronze = spark.table("bronze_fvms.b_vehicle_master")
|
||||
silver = spark.table("silver_fvms.s_vehicle_master")
|
||||
|
||||
bronze_count = bronze.count()
|
||||
silver_count = silver.count()
|
||||
dedup_rate = (bronze_count - silver_count) / bronze_count * 100
|
||||
|
||||
print(f"Deduplication removed {dedup_rate:.2f}% of rows")
|
||||
assert silver_count <= bronze_count
|
||||
```
|
||||
|
||||
## Execution Workflow
|
||||
|
||||
1. **Read target file** ($ARGUMENTS) - Understand transformation logic
|
||||
2. **Identify live data sources** - Find Bronze/Silver tables used
|
||||
3. **Create test file** - `tests/test_<target>.py`
|
||||
4. **Write fixtures** - Setup Spark session, load live data samples
|
||||
5. **Write unit tests** - Test individual utility functions
|
||||
6. **Write integration tests** - Test full ETL with live data
|
||||
7. **Write validation tests** - Check data quality on live tables
|
||||
8. **Run tests**: `pytest tests/test_<target>.py -v`
|
||||
9. **Verify coverage**: Ensure >80% coverage of transformation logic
|
||||
|
||||
## Output Deliverables
|
||||
|
||||
- ✅ pytest test file with 10+ test cases
|
||||
- ✅ conftest.py with reusable fixtures
|
||||
- ✅ pytest.ini configuration
|
||||
- ✅ Tests use LIVE data from Bronze/Silver/Gold
|
||||
- ✅ All tests pass: `pytest -v`
|
||||
- ✅ Documentation comments showing live data usage
|
||||
Reference in New Issue
Block a user