Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:37:55 +08:00
commit 506a828b22
59 changed files with 18515 additions and 0 deletions

237
commands/background.md Executable file
View File

@@ -0,0 +1,237 @@
---
description: Fires off a agent in the background to complete tasks autonomously
argument-hint: [user-prompt] | [task-file-name]
allowed-tools: Read, Task, TodoWrite
---
# Background PySpark Data Engineer Agent
Launch a PySpark data engineer agent to work autonomously in the background on ETL tasks, data pipeline fixes, or code reviews.
## Usage
**Option 1: Direct prompt**
```
/background "Fix the validation issues in g_xa_mg_statsclasscount.py"
```
**Option 2: Task file from .claude/tasks/**
```
/background code_review_fixes_task_list.md
```
## Variables
- `TASK_INPUT`: Either a direct prompt string or a task file name from `.claude/tasks/`
- `TASK_FILE_PATH`: Full path to task file if using a task file
- `PROMPT_CONTENT`: The actual prompt to send to the agent
## Instructions
### 1. Determine Task Source
Check if `$ARGUMENTS` looks like a file name (ends with `.md` or contains no spaces):
- If YES: It's a task file name from `.claude/tasks/`
- If NO: It's a direct user prompt
### 2. Load Task Content
**If using task file:**
1. List all available task files in `.claude/tasks/` directory
2. Find the task file matching the provided name (exact match or partial match)
3. Read the task file content
4. Use the full task file content as the prompt
**If using direct prompt:**
1. Use the `$ARGUMENTS` directly as the prompt
### 3. Launch PySpark Data Engineer Agent
Launch the specialized `pyspark-data-engineer` agent using the Task tool:
**Important Configuration:**
- **subagent_type**: `pyspark-data-engineer`
- **model**: `sonnet` (default) or `opus` for complex tasks
- **description**: Short 3-5 word description based on task type
- **prompt**: Complete, detailed instructions including:
- The task content (from file or direct prompt)
- Explicit instruction to follow `.claude/CLAUDE.md` best practices
- Instruction to run quality gates (syntax check, linting, formatting)
- Instruction to create a comprehensive final report
**Prompt Template:**
```
You are a PySpark data engineer working on the Unify 2.1 Data Migration project using Azure Synapse Analytics.
CRITICAL INSTRUCTIONS:
- Read and follow ALL guidelines in .claude/CLAUDE.md
- Use .claude/rules/python_rules.md for coding standards
- Maximum line length: 240 characters
- No blank lines inside functions
- Use @synapse_error_print_handler decorator on all methods
- Use NotebookLogger for all logging (not print statements)
- Use TableUtilities methods for DataFrame operations
TASK TO COMPLETE:
{TASK_CONTENT}
QUALITY GATES (MUST RUN BEFORE COMPLETION):
1. Syntax validation: python3 -m py_compile <file_path>
2. Linting: ruff check python_files/
3. Formatting: ruff format python_files/
FINAL REPORT REQUIREMENTS:
Provide a comprehensive report including:
1. Summary of changes made
2. Files modified with line numbers
3. Quality gate results (syntax, linting, formatting)
4. Testing recommendations
5. Any issues encountered and resolutions
6. Next steps or follow-up tasks
Work autonomously and complete all tasks in the list. Use your available tools to read files, make edits, run tests, and validate your work.
```
### 4. Inform User
After launching the agent, inform the user:
- Agent has been launched in the background
- Task being worked on (summary)
- Estimated completion time (if known from task file)
- The agent will work autonomously and provide a final report
## Task File Structure
Expected task file format in `.claude/tasks/`:
```markdown
# Task Title
**Date Created**: YYYY-MM-DD
**Priority**: HIGH/MEDIUM/LOW
**Estimated Total Time**: X minutes
**Files Affected**: N
## Task 1: Description
**File**: path/to/file.py
**Line**: 123
**Estimated Time**: X minutes
**Severity**: CRITICAL/HIGH/MEDIUM/LOW
**Current Code**:
```python
# code
```
**Required Fix**:
```python
# fixed code
```
**Reason**: Explanation
**Testing**: How to verify
---
(Repeat for each task)
```
## Examples
### Example 1: Using Task File
```
User: /background code_review_fixes_task_list.md
Agent Response:
1. Lists available task files
2. Finds and reads code_review_fixes_task_list.md
3. Launches pyspark-data-engineer agent with task content
4. Informs user: "PySpark data engineer agent launched to complete 9 code review fixes (est. 27 minutes)"
```
### Example 2: Using Direct Prompt
```
User: /background "Add data validation methods to the statsclasscount gold table and ensure they are called in the transform method"
Agent Response:
1. Uses the prompt directly
2. Launches pyspark-data-engineer agent with the prompt
3. Informs user: "PySpark data engineer agent launched to add data validation methods"
```
### Example 3: Partial Task File Name Match
```
User: /background code_review
Agent Response:
1. Lists task files and finds "code_review_fixes_task_list.md"
2. Confirms match with user or proceeds if unambiguous
3. Launches agent with task content
```
## Available Task Files
List available task files from `.claude/tasks/` directory when user runs the command without arguments or with "list" argument:
```
/background
/background list
```
Output:
```
Available task files in .claude/tasks/:
1. code_review_fixes_task_list.md (9 tasks, 27 min, HIGH priority)
Usage:
/background <task-file-name> - Run agent with task file
/background "your prompt" - Run agent with direct prompt
/background list - Show available task files
```
## Agent Workflow
The pyspark-data-engineer agent will:
1. **Read Context**: Load .claude/CLAUDE.md, .claude/rules/python_rules.md
2. **Analyze Tasks**: Break down task list into actionable items
3. **Execute Changes**: Read files, make edits, apply fixes
4. **Validate Work**: Run syntax checks, linting, formatting
5. **Test Changes**: Execute relevant tests if available
6. **Generate Report**: Comprehensive summary of all work completed
## Best Practices
### For Task Files
- Keep tasks atomic and well-defined
- Include file paths and line numbers
- Provide current code and required fix
- Specify testing requirements
- Estimate time for each task
- Prioritize tasks (CRITICAL, HIGH, MEDIUM, LOW)
### For Direct Prompts
- Be specific about files and functionality
- Reference table/database names
- Specify layer (bronze, silver, gold)
- Include any business requirements
- Mention quality requirements
## Success Criteria
Agent task completion requires:
- ✅ All code changes implemented
- ✅ Syntax validation passes (python3 -m py_compile)
- ✅ Linting passes (ruff check)
- ✅ Code formatted (ruff format)
- ✅ No new issues introduced
- ✅ Comprehensive final report provided
## Notes
- The agent has access to all project files and tools
- It follows medallion architecture patterns (bronze/silver/gold)
- It uses established utilities (SparkOptimiser, TableUtilities, NotebookLogger)
- It respects project coding standards (240 char lines, no blanks in functions)
- It works autonomously without requiring additional user input
- Results are reported back when complete

181
commands/branch-cleanup.md Executable file
View File

@@ -0,0 +1,181 @@
---
allowed-tools: Bash(git branch:*), Bash(git checkout:*), Bash(git push:*), Bash(git merge:*), Bash(gh:*), Read, Grep
argument-hint: [--dry-run] | [--force] | [--remote-only] | [--local-only]
description: Use PROACTIVELY to clean up merged branches, stale remotes, and organize branch structure
---
# Git Branch Cleanup & Organization
Clean up merged branches and organize repository structure: $ARGUMENTS
## Current Repository State
- All branches: !`git branch -a`
- Recent branches: !`git for-each-ref --count=10 --sort=-committerdate refs/heads/ --format='%(refname:short) - %(committerdate:relative)'`
- Remote branches: !`git branch -r`
- Merged branches: !`git branch --merged main 2>/dev/null || git branch --merged master 2>/dev/null || echo "No main/master branch found"`
- Current branch: !`git branch --show-current`
## Task
Perform comprehensive branch cleanup and organization based on the repository state and provided arguments.
## Cleanup Operations
### 1. Identify Branches for Cleanup
- **Merged branches**: Find local branches already merged into main/master
- **Stale remote branches**: Identify remote-tracking branches that no longer exist
- **Old branches**: Detect branches with no recent activity (>30 days)
- **Feature branches**: Organize feature/* hotfix/* release/* branches
### 2. Safety Checks Before Deletion
- Verify branches are actually merged using `git merge-base`
- Check if branches have unpushed commits
- Confirm branches aren't the current working branch
- Validate against protected branch patterns
### 3. Branch Categories to Handle
- **Safe to delete**: Merged feature branches, old hotfix branches
- **Needs review**: Unmerged branches with old commits
- **Keep**: Main branches (main, master, develop), active feature branches
- **Archive**: Long-running branches that might need preservation
### 4. Remote Branch Synchronization
- Remove remote-tracking branches for deleted remotes
- Prune remote references with `git remote prune origin`
- Update branch tracking relationships
- Clean up remote branch references
## Command Modes
### Default Mode (Interactive)
1. Show branch analysis with recommendations
2. Ask for confirmation before each deletion
3. Provide summary of actions taken
4. Offer to push deletions to remote
### Dry Run Mode (`--dry-run`)
1. Show what would be deleted without making changes
2. Display branch analysis and recommendations
3. Provide cleanup statistics
4. Exit without modifying repository
### Force Mode (`--force`)
1. Delete merged branches without confirmation
2. Clean up stale remotes automatically
3. Provide summary of all actions taken
4. Use with caution - no undo capability
### Remote Only (`--remote-only`)
1. Only clean up remote-tracking branches
2. Synchronize with actual remote state
3. Remove stale remote references
4. Keep all local branches intact
### Local Only (`--local-only`)
1. Only clean up local branches
2. Don't affect remote-tracking branches
3. Keep remote synchronization intact
4. Focus on local workspace organization
## Safety Features
### Pre-cleanup Validation
- Ensure working directory is clean
- Check for uncommitted changes
- Verify current branch is safe (not target for deletion)
- Create backup references if requested
### Protected Branches
Never delete branches matching these patterns:
- `main`, `master`, `develop`, `staging`, `production`
- `release/*` (unless explicitly confirmed)
- Current working branch
- Branches with unpushed commits (unless forced)
### Recovery Information
- Display git reflog references for deleted branches
- Provide commands to recover accidentally deleted branches
- Show SHA hashes for branch tips before deletion
- Create recovery script if multiple branches deleted
## Branch Organization Features
### Naming Convention Enforcement
- Suggest renaming branches to follow team conventions
- Organize branches by type (feature/, bugfix/, hotfix/)
- Identify branches that don't follow naming patterns
- Provide batch renaming suggestions
### Branch Tracking Setup
- Set up proper upstream tracking for feature branches
- Configure push/pull behavior for new branches
- Identify branches missing upstream configuration
- Fix broken tracking relationships
## Output and Reporting
### Cleanup Summary
```
Branch Cleanup Summary:
✅ Deleted 3 merged feature branches
✅ Removed 5 stale remote references
✅ Cleaned up 2 old hotfix branches
⚠️ Found 1 unmerged branch requiring attention
📊 Repository now has 8 active branches (was 18)
```
### Recovery Instructions
```
Branch Recovery Commands:
git checkout -b feature/user-auth 1a2b3c4d # Recover feature/user-auth
git push origin feature/user-auth # Restore to remote
```
## Best Practices
### Regular Maintenance Schedule
- Run cleanup weekly for active repositories
- Use `--dry-run` first to review changes
- Coordinate with team before major cleanups
- Document any non-standard branches to preserve
### Team Coordination
- Communicate branch deletion plans with team
- Check if anyone has work-in-progress on old branches
- Use GitHub/GitLab branch protection rules
- Maintain shared documentation of branch policies
### Branch Lifecycle Management
- Delete feature branches immediately after merge
- Keep release branches until next major release
- Archive long-term experimental branches
- Use tags to mark important branch states before deletion
## Example Usage
```bash
# Safe interactive cleanup
/branch-cleanup
# See what would be cleaned without changes
/branch-cleanup --dry-run
# Clean only remote tracking branches
/branch-cleanup --remote-only
# Force cleanup of merged branches
/branch-cleanup --force
# Clean only local branches
/branch-cleanup --local-only
```
## Integration with GitHub/GitLab
If GitHub CLI or GitLab CLI is available:
- Check PR status before deleting branches
- Verify branches are actually merged in web interface
- Clean up both local and remote branches consistently
- Update branch protection rules if needed

70
commands/code-review.md Executable file
View File

@@ -0,0 +1,70 @@
---
allowed-tools: Read, Bash, Grep, Glob
argument-hint: [file-path] | [commit-hash] | --full
description: Comprehensive code quality review with security, performance, and architecture analysis
---
# Code Quality Review
Perform comprehensive code quality review: $ARGUMENTS
## Current State
- Git status: !`git status --porcelain`
- Recent changes: !`git diff --stat HEAD~5`
- Repository info: !`git log --oneline -5`
- Build status: !`npm run build --dry-run 2>/dev/null || echo "No build script"`
## Task
Follow these steps to conduct a thorough code review:
1. **Repository Analysis**
- Examine the repository structure and identify the primary language/framework
- Check for configuration files (package.json, requirements.txt, Cargo.toml, etc.)
- Review README and documentation for context
2. **Code Quality Assessment**
- Scan for code smells, anti-patterns, and potential bugs
- Check for consistent coding style and naming conventions
- Identify unused imports, variables, or dead code
- Review error handling and logging practices
3. **Security Review**
- Look for common security vulnerabilities (SQL injection, XSS, etc.)
- Check for hardcoded secrets, API keys, or passwords
- Review authentication and authorization logic
- Examine input validation and sanitization
4. **Performance Analysis**
- Identify potential performance bottlenecks
- Check for inefficient algorithms or database queries
- Review memory usage patterns and potential leaks
- Analyze bundle size and optimization opportunities
5. **Architecture & Design**
- Evaluate code organization and separation of concerns
- Check for proper abstraction and modularity
- Review dependency management and coupling
- Assess scalability and maintainability
6. **Testing Coverage**
- Check existing test coverage and quality
- Identify areas lacking proper testing
- Review test structure and organization
- Suggest additional test scenarios
7. **Documentation Review**
- Evaluate code comments and inline documentation
- Check API documentation completeness
- Review README and setup instructions
- Identify areas needing better documentation
8. **Recommendations**
- Prioritize issues by severity (critical, high, medium, low)
- Provide specific, actionable recommendations
- Suggest tools and practices for improvement
- Create a summary report with next steps
Remember to be constructive and provide specific examples with file paths and line numbers where applicable.

130
commands/create-feature.md Executable file
View File

@@ -0,0 +1,130 @@
---
allowed-tools: Read, Write, Edit, Bash
argument-hint: [feature-name] | [feature-type] [name]
description: Scaffold new feature with boilerplate code, tests, and documentation
---
# Create Feature
Scaffold new feature: $ARGUMENTS
## Current Project Context
- Project structure: !`find . -maxdepth 2 -type d -name src -o -name components -o -name features | head -5`
- Current branch: !`git branch --show-current`
- Package info: @package.json or @Cargo.toml or @requirements.txt (if exists)
- Architecture docs: @docs/architecture.md or @README.md (if exists)
## Task
Follow this systematic approach to create a new feature: $ARGUMENTS
1. **Feature Planning**
- Define the feature requirements and acceptance criteria
- Break down the feature into smaller, manageable tasks
- Identify affected components and potential impact areas
- Plan the API/interface design before implementation
2. **Research and Analysis**
- Study existing codebase patterns and conventions
- Identify similar features for consistency
- Research external dependencies or libraries needed
- Review any relevant documentation or specifications
3. **Architecture Design**
- Design the feature architecture and data flow
- Plan database schema changes if needed
- Define API endpoints and contracts
- Consider scalability and performance implications
4. **Environment Setup**
- Create a new feature branch: `git checkout -b feature/$ARGUMENTS`
- Ensure development environment is up to date
- Install any new dependencies required
- Set up feature flags if applicable
5. **Implementation Strategy**
- Start with core functionality and build incrementally
- Follow the project's coding standards and patterns
- Implement proper error handling and validation
- Use dependency injection and maintain loose coupling
6. **Database Changes (if applicable)**
- Create migration scripts for schema changes
- Ensure backward compatibility
- Plan for rollback scenarios
- Test migrations on sample data
7. **API Development**
- Implement API endpoints with proper HTTP status codes
- Add request/response validation
- Implement proper authentication and authorization
- Document API contracts and examples
8. **Frontend Implementation (if applicable)**
- Create reusable components following project patterns
- Implement responsive design and accessibility
- Add proper state management
- Handle loading and error states
9. **Testing Implementation**
- Write unit tests for core business logic
- Create integration tests for API endpoints
- Add end-to-end tests for user workflows
- Test error scenarios and edge cases
10. **Security Considerations**
- Implement proper input validation and sanitization
- Add authorization checks for sensitive operations
- Review for common security vulnerabilities
- Ensure data protection and privacy compliance
11. **Performance Optimization**
- Optimize database queries and indexes
- Implement caching where appropriate
- Monitor memory usage and optimize algorithms
- Consider lazy loading and pagination
12. **Documentation**
- Add inline code documentation and comments
- Update API documentation
- Create user documentation if needed
- Update project README if applicable
13. **Code Review Preparation**
- Run all tests and ensure they pass
- Run linting and formatting tools
- Check for code coverage and quality metrics
- Perform self-review of the changes
14. **Integration Testing**
- Test feature integration with existing functionality
- Verify feature flags work correctly
- Test deployment and rollback procedures
- Validate monitoring and logging
15. **Commit and Push**
- Create atomic commits with descriptive messages
- Follow conventional commit format if project uses it
- Push feature branch: `git push origin feature/$ARGUMENTS`
16. **Pull Request Creation**
- Create PR with comprehensive description
- Include screenshots or demos if applicable
- Add appropriate labels and reviewers
- Link to any related issues or specifications
17. **Quality Assurance**
- Coordinate with QA team for testing
- Address any bugs or issues found
- Verify accessibility and usability requirements
- Test on different environments and browsers
18. **Deployment Planning**
- Plan feature rollout strategy
- Set up monitoring and alerting
- Prepare rollback procedures
- Schedule deployment and communication
Remember to maintain code quality, follow project conventions, and prioritize user experience throughout the development process.

19
commands/create-pr.md Executable file
View File

@@ -0,0 +1,19 @@
# Create Pull Request Command
Create a new branch, commit changes, and submit a pull request.
## Behavior
- Creates a new branch based on current changes
- Formats modified files using Biome
- Analyzes changes and automatically splits into logical commits when appropriate
- Each commit focuses on a single logical change or feature
- Creates descriptive commit messages for each logical unit
- Pushes branch to remote
- Creates pull request with proper summary and test plan
## Guidelines for Automatic Commit Splitting
- Split commits by feature, component, or concern
- Keep related file changes together in the same commit
- Separate refactoring from feature additions
- Ensure each commit can be understood independently
- Multiple unrelated changes should be split into separate commits

36
commands/create-prd.md Executable file
View File

@@ -0,0 +1,36 @@
---
allowed-tools: Read, Write, Edit, Grep, Glob
argument-hint: [feature-name] | --template | --interactive
description: Create Product Requirements Document (PRD) for new features
model: sonnet
---
# Create Product Requirements Document
You are an experienced Product Manager. Create a Product Requirements Document (PRD) for a feature we are adding to the product: **$ARGUMENTS**
**IMPORTANT:**
- Focus on the feature and user needs, not technical implementation
- Do not include any time estimates
## Product Context
1. **Product Documentation**: @product-development/resources/product.md (to understand the product)
2. **Feature Documentation**: @product-development/current-feature/feature.md (to understand the feature idea)
3. **JTBD Documentation**: @product-development/current-feature/JTBD.md (to understand the Jobs to be Done)
## Task
Create a comprehensive PRD document that captures the what, why, and how of the product:
1. Use the PRD template from `@product-development/resources/PRD-template.md`
2. Based on the feature documentation, create a PRD that defines:
- Problem statement and user needs
- Feature specifications and scope
- Success metrics and acceptance criteria
- User experience requirements
- Technical considerations (high-level only)
3. Output the completed PRD to `product-development/current-feature/PRD.md`
Focus on creating a comprehensive PRD that clearly defines the feature requirements while maintaining alignment with user needs and business objectives.

126
commands/create-pull-request.md Executable file
View File

@@ -0,0 +1,126 @@
# How to Create a Pull Request Using GitHub CLI
This guide explains how to create pull requests using GitHub CLI in our project.
## Prerequisites
1. Install GitHub CLI if you haven't already:
```bash
# macOS
brew install gh
# Windows
winget install --id GitHub.cli
# Linux
# Follow instructions at https://github.com/cli/cli/blob/trunk/docs/install_linux.md
```
2. Authenticate with GitHub:
```bash
gh auth login
```
## Creating a New Pull Request
1. First, prepare your PR description following the template in `.github/pull_request_template.md`
2. Use the `gh pr create` command to create a new pull request:
```bash
# Basic command structure
gh pr create --title "✨(scope): Your descriptive title" --body "Your PR description" --base main --draft
```
For more complex PR descriptions with proper formatting, use the `--body-file` option with the exact PR template structure:
```bash
# Create PR with proper template structure
gh pr create --title "✨(scope): Your descriptive title" --body-file <(echo -e "## Issue\n\n- resolve:\n\n## Why is this change needed?\nYour description here.\n\n## What would you like reviewers to focus on?\n- Point 1\n- Point 2\n\n## Testing Verification\nHow you tested these changes.\n\n## What was done\npr_agent:summary\n\n## Detailed Changes\npr_agent:walkthrough\n\n## Additional Notes\nAny additional notes.") --base main --draft
```
## Best Practices
1. **PR Title Format**: Use conventional commit format with emojis
- Always include an appropriate emoji at the beginning of the title
- Use the actual emoji character (not the code representation like `:sparkles:`)
- Examples:
- `✨(supabase): Add staging remote configuration`
- `🐛(auth): Fix login redirect issue`
- `📝(readme): Update installation instructions`
2. **Description Template**: Always use our PR template structure from `.github/pull_request_template.md`:
- Issue reference
- Why the change is needed
- Review focus points
- Testing verification
- PR-Agent sections (keep `pr_agent:summary` and `pr_agent:walkthrough` tags intact)
- Additional notes
3. **Template Accuracy**: Ensure your PR description precisely follows the template structure:
- Don't modify or rename the PR-Agent sections (`pr_agent:summary` and `pr_agent:walkthrough`)
- Keep all section headers exactly as they appear in the template
- Don't add custom sections that aren't in the template
4. **Draft PRs**: Start as draft when the work is in progress
- Use `--draft` flag in the command
- Convert to ready for review when complete using `gh pr ready`
### Common Mistakes to Avoid
1. **Incorrect Section Headers**: Always use the exact section headers from the template
2. **Modifying PR-Agent Sections**: Don't remove or modify the `pr_agent:summary` and `pr_agent:walkthrough` placeholders
3. **Adding Custom Sections**: Stick to the sections defined in the template
4. **Using Outdated Templates**: Always refer to the current `.github/pull_request_template.md` file
### Missing Sections
Always include all template sections, even if some are marked as "N/A" or "None"
## Additional GitHub CLI PR Commands
Here are some additional useful GitHub CLI commands for managing PRs:
```bash
# List your open pull requests
gh pr list --author "@me"
# Check PR status
gh pr status
# View a specific PR
gh pr view <PR-NUMBER>
# Check out a PR branch locally
gh pr checkout <PR-NUMBER>
# Convert a draft PR to ready for review
gh pr ready <PR-NUMBER>
# Add reviewers to a PR
gh pr edit <PR-NUMBER> --add-reviewer username1,username2
# Merge a PR
gh pr merge <PR-NUMBER> --squash
```
## Using Templates for PR Creation
To simplify PR creation with consistent descriptions, you can create a template file:
1. Create a file named `pr-template.md` with your PR template
2. Use it when creating PRs:
```bash
gh pr create --title "feat(scope): Your title" --body-file pr-template.md --base main --draft
```
## Related Documentation
- [PR Template](.github/pull_request_template.md)
- [Conventional Commits](https://www.conventionalcommits.org/)
- [GitHub CLI documentation](https://cli.github.com/manual/)

197
commands/describe.md Executable file
View File

@@ -0,0 +1,197 @@
---
allowed-tools: Read, mcp__mcp-server-motherduck__query, Grep, Glob, Bash
argument-hint: [file-path] (optional - defaults to currently open file)
description: Add comprehensive descriptive comments to code files, focusing on data flow, joining logic, and business context
---
# Add Descriptive Comments to Code
Add detailed, descriptive comments to the selected file: $ARGUMENTS
## Current Context
- Currently open file: !`echo $CLAUDE_OPEN_FILE`
- File layer detection: !`basename $(dirname $CLAUDE_OPEN_FILE) 2>/dev/null || echo "unknown"`
- Git status: !`git status --porcelain $CLAUDE_OPEN_FILE 2>/dev/null || echo "Not in git"`
## Task
You will add comprehensive descriptive comments to the **currently open file** (or the file specified in $ARGUMENTS if provided).
### Instructions
1. **Determine Target File**
- If $ARGUMENTS contains a file path, use that file
- Otherwise, use the currently open file from the IDE
- Verify the file exists and is readable
2. **Analyze File Context**
- Identify the file type (silver/gold layer transformation, utility, pipeline operation)
- Read and understand the complete file structure
- Identify the ETL pattern (extract, transform, load methods)
- Map out all DataFrame operations and transformations
3. **Analyze Data Sources and Schemas**
- Use DuckDB MCP to query relevant source tables if available:
```sql
-- Example: Check schema of source table
DESCRIBE table_name;
SELECT * FROM table_name LIMIT 5;
```
- Reference `.claude/memory/data_dictionary/` for column definitions and business context
- Identify all source tables being read (bronze/silver layer)
- Document the schema of input and output DataFrames
4. **Document Joining Logic (Priority Focus)**
- For each join operation, add comments explaining:
- **WHY** the join is happening (business reason)
- **WHAT** tables are being joined
- **JOIN TYPE** (left, inner, outer) and why that type was chosen
- **JOIN KEYS** and their meaning
- **EXPECTED CARDINALITY** (1:1, 1:many, many:many)
- **NULL HANDLING** strategy for unmatched records
Example format:
```python
# JOIN: Link incidents to persons involved
# Type: LEFT JOIN (preserve all incidents even if person data missing)
# Keys: incident_id (unique identifier from FVMS system)
# Expected: 1:many (one incident can have multiple persons)
# Nulls: Person details will be NULL for incidents with no associated persons
joined_df = incident_df.join(person_df, on="incident_id", how="left")
```
5. **Document Transformations Step-by-Step**
- Add inline comments explaining each transformation
- Describe column derivations and calculations
- Explain business rules being applied
- Document any data quality fixes or cleansing
- Note any deduplication logic
6. **Document Data Quality Patterns**
- Explain null handling strategies
- Document default values and their business meaning
- Describe validation rules
- Note any data type conversions
7. **Add Function/Method Documentation**
- Add docstring-style comments at the start of each method explaining:
- Purpose of the method
- Input: Source tables and their schemas
- Output: Resulting table and schema
- Business logic summary
Example format:
```python
def transform(self) -> DataFrame:
"""
Transform incident data with person and location enrichment.
Input: bronze_fvms.b_fvms_incident (raw incident records)
Output: silver_fvms.s_fvms_incident (validated, enriched incidents)
Transformations:
1. Join with person table to add demographic details
2. Join with address table to add location coordinates
3. Apply business rules for incident classification
4. Deduplicate based on incident_id and date_created
5. Add row hash for change detection
Business Context:
- Incidents represent family violence events recorded in FVMS
- Each incident may involve multiple persons (victims, offenders)
- Location data enables geographic analysis and reporting
"""
```
8. **Add Header Comments**
- Add a comprehensive header at the top of the file explaining:
- File purpose and business context
- Source systems and tables
- Target table and database
- Key transformations and business rules
- Dependencies on other tables or processes
9. **Variable Naming Context**
- When variable names are abbreviated or unclear, add comments explaining:
- What the variable represents
- The business meaning of the data
- Expected data types and formats
- Reference data dictionary entries if available
10. **Use Data Dictionary References**
- Check `.claude/memory/data_dictionary/` for column definitions
- Reference these definitions in comments to explain field meanings
- Link business terminology to technical column names
- Example: `# offence_code: Maps to ANZSOC classification system (see data_dict/cms_offence_codes.md)`
11. **Query DuckDB for Context (When Available)**
- Use MCP DuckDB tool to inspect actual data patterns:
- Check distinct values: `SELECT DISTINCT column_name FROM table LIMIT 20;`
- Verify join relationships: `SELECT COUNT(*) FROM table1 JOIN table2 ...`
- Understand data distributions: `SELECT column, COUNT(*) FROM table GROUP BY column;`
- Use insights from queries to write more accurate comments
12. **Preserve Code Formatting Standards**
- Do NOT add blank lines inside functions (project standard)
- Maximum line length: 240 characters
- Maintain existing indentation
- Keep comments concise but informative
- Use inline comments for single-line explanations
- Use block comments for multi-step processes
13. **Focus Areas by File Type**
**Silver Layer Files (`python_files/silver/`):**
- Document source bronze tables
- Explain validation rules
- Describe enumeration mappings
- Note data cleansing operations
**Gold Layer Files (`python_files/gold/`):**
- Document all source silver tables
- Explain aggregation logic
- Describe business metrics calculations
- Note analytical transformations
**Utility Files (`python_files/utilities/`):**
- Explain helper function purposes
- Document parameter meanings
- Describe return values
- Note edge cases handled
14. **Comment Quality Guidelines**
- Comments should explain **WHY**, not just **WHAT**
- Avoid obvious comments (e.g., don't say "create dataframe" for `df = spark.createDataFrame()`)
- Focus on business context and data relationships
- Use proper grammar and complete sentences
- Be concise but thorough
- Think like a new developer reading the code for the first time
15. **Final Validation**
- Run syntax check: `python3 -m py_compile <file>`
- Run linting: `ruff check <file>`
- Format code: `ruff format <file>`
- Ensure all comments are accurate and helpful
## Example Output Structure
After adding comments, the file should have:
- ✅ Comprehensive header explaining file purpose
- ✅ Method-level documentation for extract/transform/load
- ✅ Detailed join operation comments (business reason, type, keys, cardinality)
- ✅ Step-by-step transformation explanations
- ✅ Data quality and validation logic documented
- ✅ Variable context for unclear names
- ✅ References to data dictionary where applicable
- ✅ Business context linking technical operations to real-world meaning
## Important Notes
- **ALWAYS** use Australian English spelling conventions throughout the comments and documentation
- **DO NOT** remove or modify existing functionality
- **DO NOT** change code structure or logic
- **ONLY** add descriptive comments
- **PRESERVE** all existing comments
- **MAINTAIN** project coding standards (no blank lines in functions, 240 char max)
- **USE** the data dictionary and DuckDB queries to provide accurate context
- **THINK** about the user who will read this code - walk them through the logic clearly

88
commands/dev-agent.md Executable file
View File

@@ -0,0 +1,88 @@
# PySpark Azure Synapse Expert Agent
## Overview
Expert data engineer specializing in PySpark development within Azure Synapse Analytics environment. Focuses on scalable data processing, optimization, and enterprise-grade solutions.
## Core Competencies
### PySpark Expertise
- Advanced DataFrame/Dataset operations
- Performance optimization and tuning
- Custom UDFs and aggregations
- Spark SQL query optimization
- Memory management and partitioning strategies
### Azure Synapse Mastery
- Synapse Spark pools configuration
- Integration with Azure Data Lake Storage
- Synapse Pipelines orchestration
- Serverless SQL pools interaction
### Data Engineering Skills
- ETL/ELT pipeline design
- Data quality and validation frameworks
## Technical Stack
### Languages & Frameworks
- **Primary**: Python, PySpark
- **Secondary**: SQL, PowerShell
- **Libraries**: pandas, numpy, pytest
### Azure Services
- Azure Synapse Analytics
- Azure Data Lake Storage Gen2
- Azure Key Vault
### Tools & Platforms
- Git/Azure DevOps
- Jupyter/Synapse Notebooks
## Responsibilities
### Development
- Design optimized PySpark jobs for large-scale data processing
- Implement data transformation logic with performance considerations
- Create reusable libraries and frameworks
- Build automated testing suites for data pipelines
### Optimization
- Analyze and tune Spark job performance
- Optimize cluster configurations and resource allocation
- Implement caching strategies and data skew handling
- Monitor and troubleshoot production workloads
### Architecture
- Design scalable data lake architectures
- Establish data partitioning and storage strategies
- Define data governance and security protocols
- Create disaster recovery and backup procedures
## Best Practices
**CRITICAL** read .claude/CLAUDE.md for best practices
### Performance
- Leverage broadcast joins and bucketing
- Optimize shuffle operations and partition sizes
- Use appropriate file formats (Parquet, Delta)
- Implement incremental processing patterns
### Security
- Implement row-level and column-level security
- Use managed identities and service principals
- Encrypt data at rest and in transit
- Follow least privilege access principles
## Communication Style
- Provides technical solutions with clear performance implications
- Focuses on scalable, production-ready implementations
- Emphasizes best practices and enterprise patterns
- Delivers concise explanations with practical examples
## Key Metrics
- Pipeline execution time and resource utilization
- Data quality scores and SLA compliance
- Cost optimization and resource efficiency
- System reliability and uptime statistics

194
commands/explain-code.md Executable file
View File

@@ -0,0 +1,194 @@
# Analyze and Explain Code Functionality
Analyze and explain code functionality
## Instructions
Follow this systematic approach to explain code: **$ARGUMENTS**
1. **Code Context Analysis**
- Identify the programming language and framework
- Understand the broader context and purpose of the code
- Identify the file location and its role in the project
- Review related imports, dependencies, and configurations
2. **High-Level Overview**
- Provide a summary of what the code does
- Explain the main purpose and functionality
- Identify the problem the code is solving
- Describe how it fits into the larger system
3. **Code Structure Breakdown**
- Break down the code into logical sections
- Identify classes, functions, and methods
- Explain the overall architecture and design patterns
- Map out data flow and control flow
4. **Line-by-Line Analysis**
- Explain complex or non-obvious lines of code
- Describe variable declarations and their purposes
- Explain function calls and their parameters
- Clarify conditional logic and loops
5. **Algorithm and Logic Explanation**
- Describe the algorithm or approach being used
- Explain the logic behind complex calculations
- Break down nested conditions and loops
- Clarify recursive or asynchronous operations
6. **Data Structures and Types**
- Explain data types and structures being used
- Describe how data is transformed or processed
- Explain object relationships and hierarchies
- Clarify input and output formats
7. **Framework and Library Usage**
- Explain framework-specific patterns and conventions
- Describe library functions and their purposes
- Explain API calls and their expected responses
- Clarify configuration and setup code
8. **Error Handling and Edge Cases**
- Explain error handling mechanisms
- Describe exception handling and recovery
- Identify edge cases being handled
- Explain validation and defensive programming
9. **Performance Considerations**
- Identify performance-critical sections
- Explain optimization techniques being used
- Describe complexity and scalability implications
- Point out potential bottlenecks or inefficiencies
10. **Security Implications**
- Identify security-related code sections
- Explain authentication and authorization logic
- Describe input validation and sanitization
- Point out potential security vulnerabilities
11. **Testing and Debugging**
- Explain how the code can be tested
- Identify debugging points and logging
- Describe mock data or test scenarios
- Explain test helpers and utilities
12. **Dependencies and Integrations**
- Explain external service integrations
- Describe database operations and queries
- Explain API interactions and protocols
- Clarify third-party library usage
**Explanation Format Examples:**
**For Complex Algorithms:**
```
This function implements a depth-first search algorithm:
1. Line 1-3: Initialize a stack with the starting node and a visited set
2. Line 4-8: Main loop - continue until stack is empty
3. Line 9-11: Pop a node and check if it's the target
4. Line 12-15: Add unvisited neighbors to the stack
5. Line 16: Return null if target not found
Time Complexity: O(V + E) where V is vertices and E is edges
Space Complexity: O(V) for the visited set and stack
```
**For API Integration Code:**
```
This code handles user authentication with a third-party service:
1. Extract credentials from request headers
2. Validate credential format and required fields
3. Make API call to authentication service
4. Handle response and extract user data
5. Create session token and set cookies
6. Return user profile or error response
Error Handling: Catches network errors, invalid credentials, and service unavailability
Security: Uses HTTPS, validates inputs, and sanitizes responses
```
**For Database Operations:**
```
This function performs a complex database query with joins:
1. Build base query with primary table
2. Add LEFT JOIN for related user data
3. Apply WHERE conditions for filtering
4. Add ORDER BY for consistent sorting
5. Implement pagination with LIMIT/OFFSET
6. Execute query and handle potential errors
7. Transform raw results into domain objects
Performance Notes: Uses indexes on filtered columns, implements connection pooling
```
13. **Common Patterns and Idioms**
- Identify language-specific patterns and idioms
- Explain design patterns being implemented
- Describe architectural patterns in use
- Clarify naming conventions and code style
14. **Potential Improvements**
- Suggest code improvements and optimizations
- Identify possible refactoring opportunities
- Point out maintainability concerns
- Recommend best practices and standards
15. **Related Code and Context**
- Reference related functions and classes
- Explain how this code interacts with other components
- Describe the calling context and usage patterns
- Point to relevant documentation and resources
16. **Debugging and Troubleshooting**
- Explain how to debug issues in this code
- Identify common failure points
- Describe logging and monitoring approaches
- Suggest testing strategies
**Language-Specific Considerations:**
**JavaScript/TypeScript:**
- Explain async/await and Promise handling
- Describe closure and scope behavior
- Clarify this binding and arrow functions
- Explain event handling and callbacks
**Python:**
- Explain list comprehensions and generators
- Describe decorator usage and purpose
- Clarify context managers and with statements
- Explain class inheritance and method resolution
**Java:**
- Explain generics and type parameters
- Describe annotation usage and processing
- Clarify stream operations and lambda expressions
- Explain exception hierarchy and handling
**C#:**
- Explain LINQ queries and expressions
- Describe async/await and Task handling
- Clarify delegate and event usage
- Explain nullable reference types
**Go:**
- Explain goroutines and channel usage
- Describe interface implementation
- Clarify error handling patterns
- Explain package structure and imports
**Rust:**
- Explain ownership and borrowing
- Describe lifetime annotations
- Clarify pattern matching and Option/Result types
- Explain trait implementations
Remember to:
- Use clear, non-technical language when possible
- Provide examples and analogies for complex concepts
- Structure explanations logically from high-level to detailed
- Include visual diagrams or flowcharts when helpful
- Tailor the explanation level to the intended audience

361
commands/local-commit.md Executable file
View File

@@ -0,0 +1,361 @@
---
allowed-tools: Bash(git add:*), Bash(git status:*), Bash(git commit:*), Bash(git diff:*), Bash(git log:*), Bash(git push:*), Bash(git pull:*), Bash(git branch:*), mcp__ado__repo_list_branches_by_repo, mcp__ado__repo_search_commits, mcp__ado__repo_create_pull_request, mcp__ado__repo_get_pull_request_by_id, mcp__ado__repo_get_repo_by_name_or_id, mcp__ado__wit_add_work_item_comment, mcp__ado__wit_get_work_item
argument-hint: [message] | --no-verify | --amend | --pr-s | --pr-d | --pr-m
description: Create well-formatted commits with conventional commit format and emoji, integrated with Azure DevOps
---
# Smart Git Commit with Azure DevOps Integration
Create well-formatted commit: $ARGUMENTS
## Repository Configuration
- **Project**: Program Unify
- **Repository ID**: e030ea00-2f85-4b19-88c3-05a864d7298d
- **Repository Name**: unify_2_1_dm_synapse_env_d10
- **Branch Structure**: `feature/* → staging → develop → main`
- **Main Branch**: main
## Implementation Logic for Claude
When processing this command, Claude should:
1. **Detect Repository**: Check if current repo is `unify_2_1_dm_synapse_env_d10`
- Use `git remote -v` or check current directory path
- Can also use `mcp__ado__repo_get_repo_by_name_or_id` to verify
2. **Parse Arguments**: Extract flags from `$ARGUMENTS`
- **PR Flags**:
- `--pr-s`: Set target = `staging`
- `--pr-d`: Set target = `develop`
- `--pr-m`: Set target = `main`
- `--pr` (no suffix): ERROR if unify_2_1_dm_synapse_env_d10, else target = `develop`
3. **Validate Current Branch** (if PR flag provided):
- Get current branch: `git branch --show-current`
- For `--pr-s`: Require `feature/*` branch (reject `staging`, `develop`, `main`)
- For `--pr-d`: Require `staging` branch exactly
- For `--pr-m`: Require `develop` branch exactly
- If validation fails: Show clear error and exit
4. **Execute Commit Workflow**:
- Stage changes (`git add .` )
- Create commit with emoji conventional format
- Run pre-commit hooks (unless `--no-verify`)
- Push to current branch
5. **Create Pull Request** (if PR flag):
- Call `mcp__ado__repo_create_pull_request` with:
- `repository_id`: e030ea00-2f85-4b19-88c3-05a864d7298d
- `source_branch`: Current branch from step 3
- `target_branch`: Target from step 2
- `title`: Extract from commit message
- `description`: Generate with summary and test plan
- Return PR URL to user
6. **Add Work Item Comments Automatically** (if PR was created in step 5):
- **Condition Check**: Only execute if:
- A PR was created in step 5 (any `--pr-*` flag was used)
- PR creation was successful and returned a PR ID
- **Get Work Items from PR**:
- Use `mcp__ado__repo_get_pull_request_by_id` with:
- `repositoryId`: e030ea00-2f85-4b19-88c3-05a864d7298d
- `pullRequestId`: PR ID from step 5
- `includeWorkItemRefs`: true
- Extract work item IDs from the PR response
- If no work items found, log info message and skip to next step
- **Add Comments to Each Work Item**:
- For each work item ID extracted from PR:
- Use `mcp__ado__wit_get_work_item` to verify work item exists
- Generate comment with:
- PR title and number
- Commit message and SHA
- File changes summary from `git diff --stat`
- Link to PR in Azure DevOps
- Link to commit in Azure DevOps
- **IMPORTANT**: Do NOT include any footer text like "Automatically added by /local-commit command" or similar attribution
- Call `mcp__ado__wit_add_work_item_comment` with:
- `project`: "Program Unify"
- `workItemId`: Current work item ID
- `comment`: Generated comment with HTML formatting
- `format`: "html"
- Log success/failure for each work item
- If ANY work item fails, warn but don't fail the commit
## Current Repository State
- Git status: !`git status --short`
- Current branch: !`git branch --show-current`
- Staged changes: !`git diff --cached --stat`
- Unstaged changes: !`git diff --stat`
- Recent commits: !`git log --oneline -5`
## What This Command Does
1. Analyzes current git status and changes
2. If no files staged, stages all modified files with `git add`
3. Reviews changes with `git diff`
4. Analyzes for multiple logical changes
5. For complex changes, suggests split commits
6. Creates commit with emoji conventional format
7. Automatically runs pre-commit hooks (ruff lint/format, trailing whitespace, etc.)
- Pre-commit may modify files (auto-fixes)
- If files are modified, they'll be re-staged automatically
- Use `--no-verify` to skip hooks in emergencies only
8. **NEW**: With PR flags, creates Azure DevOps pull request after push
- Uses `mcp__ado__repo_create_pull_request` to create PR
- Automatically links work items if commit message contains work item IDs
- **IMPORTANT Branch Flow Rules** (unify_2_1_dm_synapse_env_d10 ONLY):
- `--pr-s`: Feature branch → `staging` (standard feature PR)
- `--pr-d`: `staging``develop` (promote staging to develop)
- `--pr-m`: `develop``main` (promote develop to production)
- `--pr`: **NOT ALLOWED** - must specify `-s`, `-d`, or `-m` for this repository
- **For OTHER repositories**: `--pr` creates PR to `develop` branch (legacy behavior)
9. **NEW**: Automatically adds comments to linked work items after PR creation
- Retrieves work items linked to the PR using `mcp__ado__repo_get_pull_request_by_id`
- Automatically adds comment to each linked work item with:
- PR title and number
- Commit message and SHA
- Summary of file changes
- Direct link to PR in Azure DevOps
- Direct link to commit in Azure DevOps
- **IMPORTANT**: No footer attribution text (e.g., "Automatically added by /local-commit command")
- Validates work items exist before commenting
- Continues even if some work items fail (warns only)
## Commit Message Format
### Type + Emoji Mapping
-`feat`: New feature
- 🐛 `fix`: Bug fix
- 📝 `docs`: Documentation
- 💄 `style`: Formatting/style
- ♻️ `refactor`: Code refactoring
- ⚡️ `perf`: Performance improvements
-`test`: Tests
- 🔧 `chore`: Tooling, configuration
- 🚀 `ci`: CI/CD improvements
- ⏪️ `revert`: Reverting changes
- 🚨 `fix`: Compiler/linter warnings
- 🔒️ `fix`: Security issues
- 🩹 `fix`: Simple non-critical fix
- 🚑️ `fix`: Critical hotfix
- 🎨 `style`: Code structure/format
- 🔥 `fix`: Remove code/files
- 📦️ `chore`: Dependencies
- 🌱 `chore`: Seed files
- 🧑‍💻 `chore`: Developer experience
- 🏷️ `feat`: Types
- 💬 `feat`: Text/literals
- 🌐 `feat`: i18n/l10n
- 💡 `feat`: Business logic
- 📱 `feat`: Responsive design
- 🚸 `feat`: UX improvements
- ♿️ `feat`: Accessibility
- 🗃️ `db`: Database changes
- 🚩 `feat`: Feature flags
- ⚰️ `refactor`: Remove dead code
- 🦺 `feat`: Validation
## Commit Strategy
### Single Commit (Default)
```bash
git add .
git commit -m "✨ feat: implement user auth"
```
### Multiple Commits (Complex Changes)
```bash
# Stage and commit separately
git add src/auth.py
git commit -m "✨ feat: add authentication module"
git add tests/test_auth.py
git commit -m "✅ test: add auth unit tests"
git add docs/auth.md
git commit -m "📝 docs: document auth API"
# Push all commits
git push
```
## Pre-Commit Hooks
Your project uses pre-commit with:
- **Ruff**: Linting with auto-fix + formatting
- **Standard hooks**: Trailing whitespace, AST check, YAML/JSON/TOML validation
- **Security**: Private key detection
- **Quality**: Debug statement detection, merge conflict check
**Important**: Pre-commit hooks will auto-fix issues and may modify your files. The commit process will:
1. Run pre-commit hooks
2. If hooks modify files, automatically re-stage them
3. Complete the commit with all fixes applied
## Command Options
- `--no-verify`: Skip pre-commit checks (emergency use only)
- `--amend`: Amend previous commit
- **`--pr-s`**: Create PR to `staging` branch (feature → staging)
- **`--pr-d`**: Create PR to `develop` branch (staging → develop)
- **`--pr-m`**: Create PR to `main` branch (develop → main)
- `--pr`: Legacy flag for other repositories (creates PR to `develop`)
- **NOT ALLOWED** in unify_2_1_dm_synapse_env_d10 - must use `-s`, `-d`, or `-m`
- Default: Run all pre-commit hooks and create new commit
- **Automatic Work Item Comments**: When using any PR flag, work items linked to the PR will automatically receive comments with commit details (no footer attribution)
## Azure DevOps Integration Features
### Pull Request Workflow (PR Flags)
When using PR flags, the command will:
1. Commit changes locally
2. Push to remote branch
3. Validate repository and branch configuration:
- **THIS repo (unify_2_1_dm_synapse_env_d10)**: Requires explicit flag (`--pr-s`, `--pr-d`, or `--pr-m`)
- `--pr-s`: Current feature branch → `staging`
- `--pr-d`: Must be on `staging` branch → `develop`
- `--pr-m`: Must be on `develop` branch → `main`
- `--pr` alone: **ERROR** - must specify target
- **OTHER repos**: `--pr` creates PR to `develop` (all other flags ignored)
4. Use `mcp__ado__repo_create_pull_request` to create PR with:
- **Title**: Extracted from commit message
- **Description**: Full commit details with summary and test plan
- **Source Branch**: Current branch
- **Target Branch**: Determined by flag and repository
- **Work Items**: Auto-linked from commit message (e.g., "fixes #12345")
### Viewing Commit History
You can view commit history using:
- `mcp__ado__repo_search_commits` - Search commits by branch, author, date range
- Traditional `git log` - For local history
### Branch Management
- `mcp__ado__repo_list_branches_by_repo` - View all Azure DevOps branches
- `git branch` - View local branches
## Branch Validation Rules (unify_2_1_dm_synapse_env_d10)
Before creating a PR, the command validates:
### --pr-s (Feature → Staging)
-**ALLOWED**: Any `feature/*` branch
-**BLOCKED**: `staging`, `develop`, `main` branches
- **Target**: `staging`
### --pr-d (Staging → Develop)
-**ALLOWED**: Only `staging` branch
-**BLOCKED**: All other branches (including `feature/*`)
- **Target**: `develop`
### --pr-m (Develop → Main)
-**ALLOWED**: Only `develop` branch
-**BLOCKED**: All other branches (including `staging`, `feature/*`)
- **Target**: `main`
### --pr (Legacy - NOT ALLOWED)
-**BLOCKED**: All branches in unify_2_1_dm_synapse_env_d10
- 💡 **Error Message**: "Must use --pr-s, --pr-d, or --pr-m for this repository"
-**ALLOWED**: All other repositories (targets `develop`)
## Best Practices
1. **Let pre-commit work** - Don't use `--no-verify` unless absolutely necessary
2. **Atomic commits** - One logical change per commit
3. **Descriptive messages** - Emoji + type + clear description
4. **Review before commit** - Always check `git diff`
5. **Clean history** - Split complex changes into multiple commits
6. **Trust the hooks** - They maintain code quality automatically
7. **Use correct PR flag** - `--pr-s` for features, `--pr-d` for staging promotion, `--pr-m` for production
8. **Link work items** - Reference Azure DevOps work items in commit messages (e.g., "#43815") to enable automatic PR linking
9. **Validate branch** - Ensure you're on the correct branch before using `--pr-d` or `--pr-m`
10. **Work item linking** - Work items linked to PRs will automatically receive comments with commit details
11. **Keep stakeholders informed** - Use PR flags to ensure work items are automatically updated with progress
## Example Workflows
### Simple Commit
```bash
/commit "fix: resolve enum import error"
```
### Commit with Work Item
```bash
/commit "feat: add enum imports for Synapse environment"
```
### Commit and Create PR (Feature to Staging)
```bash
/commit --pr-s "feat: refactor commit command with ADO MCP integration"
```
This will:
1. Create commit locally
2. Push to current branch
3. Create PR: `feature/xyz → staging`
4. Link work items automatically if mentioned in commit message
### Promote Staging to Develop
```bash
# First checkout staging branch
git checkout staging
git pull origin staging
# Then commit and create PR
/commit --pr-d "release: promote staging changes to develop"
```
This will:
1. Create commit on `staging` branch
2. Push to `staging`
3. Create PR: `staging → develop`
### Promote Develop to Main (Production)
```bash
# First checkout develop branch
git checkout develop
git pull origin develop
# Then commit and create PR
/commit --pr-m "release: promote develop to production"
```
This will:
1. Create commit on `develop` branch
2. Push to `develop`
3. Create PR: `develop → main`
### Error: Using --pr without suffix
```bash
/commit --pr "feat: some feature"
```
**Result**: ERROR - unify_2_1_dm_synapse_env_d10 requires explicit PR target (`--pr-s`, `--pr-d`, or `--pr-m`)
### Feature PR with Automatic Work Item Comments
```bash
# On feature/xyz branch
/commit --pr-s "feat(user-auth): implement OAuth2 authentication #12345"
```
This will:
1. Create commit on feature branch
2. Push to feature branch
3. Create PR: `feature/xyz → staging`
4. Link work item #12345 to the PR
5. Automatically add comment to work item #12345 with:
- PR title and number
- Commit message and SHA
- File changes summary
- Link to PR in Azure DevOps
- Link to commit in Azure DevOps
- (No footer attribution text)
### Staging to Develop PR with Multiple Work Items
```bash
# On staging branch
/commit --pr-d "release: promote staging to develop - fixes #12345, #67890"
```
This will:
1. Create commit on `staging` branch
2. Push to `staging`
3. Create PR: `staging → develop`
4. Link work items #12345 and #67890 to the PR
5. Automatically add comments to both work items with PR and commit details (without footer attribution)
**Note**: Work items are automatically detected from commit message and linked to PR. Comments are added automatically to all linked work items without any footer text.

125
commands/multi-agent.md Executable file
View File

@@ -0,0 +1,125 @@
---
description: Discuss multi-agent workflow strategy for a specific task
argument-hint: [task-description]
allowed-tools: Read, Task, TodoWrite
---
# Multi-Agent Workflow Discussion
Prepare to discuss how you will use a multi-agent workflow to ${ARGUMENTS}.
## Instructions
1. **Analyze the Task**: ${ARGUMENTS}
- Break down the complexity
- Identify parallelizable components
- Determine if multi-agent approach is optimal
2. **Evaluate Approach**:
- Should this use `/background` (single agent) or `/orchestrate` (multiple agents)?
- How many agents would be optimal?
- What are the dependencies between subtasks?
3. **Design Strategy**:
- Outline the orchestration plan
- Define agent responsibilities
- Specify communication format (JSON responses)
- Identify quality gates
4. **Provide Recommendations**:
- Best command to use (`/background` vs `/orchestrate`)
- Number of agents (if using orchestrate: 2-8)
- Subtask breakdown
- Estimated completion time
- Resource requirements
## Output Format
Provide a structured discussion covering:
1. **Task Analysis**: Understanding of "${ARGUMENTS}"
2. **Complexity Assessment**: Simple/Moderate/High
3. **Recommended Approach**: `/background` or `/orchestrate`
4. **Agent Breakdown** (if orchestrate):
- Agent 1: Responsibility
- Agent 2: Responsibility
- ...
5. **Dependencies**: None / Sequential / Hybrid
6. **Estimated Time**: X minutes
7. **Next Steps**: Commands to run
## Examples
### Example 1: Simple Task
```
User: /aa_command "optimize the g_x_mg_statsclasscount table"
Output:
Task Analysis: Single gold table optimization - focused work
Complexity: Moderate
Recommended: /background
Reasoning: Single file, sequential optimization steps
Estimated Time: 15 minutes
Next Step: /background "Optimize g_x_mg_statsclasscount.py for performance"
```
### Example 2: Complex Task
```
User: /aa_command "implement data quality framework across all layers"
Output:
Task Analysis: Cross-layer feature implementation
Complexity: High
Recommended: /orchestrate
Agent Breakdown:
- Agent 1: Design quality schema and base classes
- Agent 2: Implement bronze layer validators
- Agent 3: Implement silver layer validators
- Agent 4: Implement gold layer validators
- Agent 5: Create validation tests
- Agent 6: Update documentation
Dependencies: Agent 1 must complete before 2-4 (hybrid approach)
Estimated Time: 45-60 minutes
Next Step: Create task file and run /orchestrate data_quality_framework.md
```
### Example 3: Medium Complexity
```
User: /aa_command "fix all linting errors in the silver layer"
Output:
Task Analysis: Code quality fixes across multiple files
Complexity: Moderate-High
Recommended: /orchestrate (if >15 files) or /background (if <15 files)
Agent Breakdown:
- Agent 1: Fix linting in silver_cms files
- Agent 2: Fix linting in silver_fvms files
- Agent 3: Fix linting in silver_nicherms files
Dependencies: None (fully parallel)
Estimated Time: 20-30 minutes
Next Step: /orchestrate "Fix linting errors: silver_cms, silver_fvms, silver_nicherms in parallel"
```
## Usage
```bash
# Discuss strategy for any task
/aa_command "optimize all gold tables for performance"
# Get recommendations for feature implementation
/aa_command "add monitoring and alerting to the pipeline"
# Plan refactoring work
/aa_command "refactor all ETL classes to use new base class pattern"
# Evaluate testing strategy
/aa_command "write comprehensive tests for the medallion architecture"
```
## Notes
- This command helps you plan before executing
- Use this to determine optimal agent strategy
- Creates a blueprint for `/background` or `/orchestrate` commands
- Considers parallelism, dependencies, and complexity
- Provides concrete next steps and command examples

54
commands/my-devops-tasks.md Executable file
View File

@@ -0,0 +1,54 @@
# ADO MCP Task Retrieval Prompt
Use the Azure DevOps MCP tools to retrieve all user stories and tasks assigned to me that are currently in "New", "Active", "Committed", or "Backlog" states. Create a comprehensive markdown document with the following structure:
## Query Parameters
- **Assigned To**: @Me
- **Work Item Types**: User Story, Task, Bug
- **States**: New, Active, Committed, Backlog
- **Include**: All active iterations and backlog
## Required Output Format
```markdown
# My Active Work Items
## Summary
- **Total Items**: {count}
- **By Type**: {breakdown by work item type}
- **By State**: {breakdown by state}
- **Last Updated**: {current date}
## Work Items
### {Work Item Type} - {ID}: {Title}
**URL** {URL to work item}
**Status**: {State} | **Priority**: {Priority} | **Effort**: {Story Points/Original Estimate}
**Iteration**: S{Iteration Path} | **Area**: {Area Path}
**Description Summary**:
{Provide a 2-3 sentence summary of the description/acceptance criteria}
**Key Details**:
- **Created**: {Created Date}
- **Tags**: {Tags if any}
- **Parent**: {Parent work item if applicable}
**[View in ADO]({URL to work item})**
---
```
## Specific Requirements
1. **Summarize Descriptions**: For each work item, provide a concise 2-3 sentence summary of the description and acceptance criteria, focusing on the core objective and deliverables.
2. **Clickable URLs**: Ensure all Azure DevOps URLs are properly formatted as clickable markdown links. including the actual work item
3. **Sort Order**: Sort by Priority (High to Low), then by State (Active, Committed, New, Backlog), then by Story Points/Effort (High to Low).
4. **Data Validation**: If any work items have missing key fields (Priority, Effort, etc.), note this in the output.
5. **Additional Context**: Include any relevant comments from the last 7 days if present.
Execute this query and generate the markdown document with all my currently assigned work items.

510
commands/orchestrate.md Executable file
View File

@@ -0,0 +1,510 @@
---
description: Orchestrate multiple generic agents working in parallel on complex tasks
argument-hint: [user-prompt] | [task-file-name]
allowed-tools: Read, Task, TodoWrite
---
# Multi-Agent Orchestrator
Launch an orchestrator agent that coordinates multiple generic agents working in parallel on complex, decomposable tasks. All agents communicate via JSON format for structured coordination.
## Usage
**Option 1: Direct prompt**
```
/orchestrate "Analyze all gold tables, identify optimization opportunities, and implement improvements across the codebase"
```
**Option 2: Task file from .claude/tasks/**
```
/orchestrate multi_agent_pipeline_optimization.md
```
**Option 3: List available orchestration tasks**
```
/orchestrate list
```
## Variables
- `TASK_INPUT`: Either a direct prompt string or a task file name from `.claude/tasks/`
- `TASK_FILE_PATH`: Full path to task file if using a task file
- `PROMPT_CONTENT`: The actual prompt to send to the orchestrator agent
## Instructions
### 1. Determine Task Source
Check if `$ARGUMENTS` looks like a file name (ends with `.md` or contains no spaces):
- If YES: It's a task file name from `.claude/tasks/`
- If NO: It's a direct user prompt
- If "list": Show available orchestration task files
### 2. Load Task Content
**If using task file:**
1. List all available task files in `.claude/tasks/` directory
2. Find the task file matching the provided name (exact match or partial match)
3. Read the task file content
4. Use the full task file content as the prompt
**If using direct prompt:**
1. Use the `$ARGUMENTS` directly as the prompt
**If "list" command:**
1. Show all available orchestration task files with metadata
2. Exit without launching agents
### 3. Launch Orchestrator Agent
Launch the orchestrator agent using the Task tool with the following configuration:
**Important Configuration:**
- **subagent_type**: `general-purpose`
- **model**: `sonnet` (default) or `opus` for highly complex orchestrations
- **description**: Short 3-5 word description (e.g., "Orchestrate pipeline optimization")
- **prompt**: Complete orchestrator instructions (see template below)
**Orchestrator Prompt Template:**
```
You are an ORCHESTRATOR AGENT coordinating multiple generic worker agents on a complex project task.
PROJECT CONTEXT:
- Project: Unify 2.1 Data Migration using Azure Synapse Analytics
- Architecture: Medallion pattern (Bronze/Silver/Gold layers)
- Primary Language: PySpark Python
- Follow: .claude/CLAUDE.md and .claude/rules/python_rules.md
YOUR ORCHESTRATOR RESPONSIBILITIES:
1. Analyze the main task and decompose it into 2-8 independent subtasks
2. Launch multiple generic worker agents (use Task tool with subagent_type="general-purpose")
3. Provide each worker agent with:
- Clear, self-contained instructions
- Required context (file paths, requirements)
- Expected JSON response format
4. Collect and aggregate all worker responses
5. Validate completeness and consistency
6. Produce final consolidated report
MAIN TASK TO ORCHESTRATE:
{TASK_CONTENT}
WORKER AGENT COMMUNICATION PROTOCOL:
Each worker agent MUST return results in this JSON format:
```json
{
"agent_id": "unique_identifier",
"task_assigned": "brief description",
"status": "completed|failed|partial",
"results": {
"files_modified": ["path/to/file1.py", "path/to/file2.py"],
"changes_summary": "description of changes",
"metrics": {
"lines_added": 0,
"lines_removed": 0,
"functions_added": 0,
"issues_fixed": 0
}
},
"quality_checks": {
"syntax_check": "passed|failed",
"linting": "passed|failed",
"formatting": "passed|failed"
},
"issues_encountered": ["issue1", "issue2"],
"recommendations": ["recommendation1", "recommendation2"],
"execution_time_seconds": 0
}
```
WORKER AGENT PROMPT TEMPLATE:
When launching each worker agent, use this prompt structure:
```
You are a WORKER AGENT (ID: {agent_id}) reporting to an orchestrator.
CRITICAL: You MUST return your results in JSON format as specified below.
PROJECT CONTEXT:
- Read and follow: .claude/CLAUDE.md and .claude/rules/python_rules.md
- Coding Standards: 240 char lines, no blanks in functions, type hints required
- Use: @synapse_error_print_handler decorator, NotebookLogger, TableUtilities
YOUR ASSIGNED SUBTASK:
{subtask_description}
FILES TO WORK ON:
{file_list}
REQUIREMENTS:
{specific_requirements}
QUALITY GATES (MUST RUN):
1. python3 -m py_compile <modified_files>
2. ruff check python_files/
3. ruff format python_files/
REQUIRED JSON RESPONSE FORMAT:
```json
{
"agent_id": "{agent_id}",
"task_assigned": "{subtask_description}",
"status": "completed",
"results": {
"files_modified": [],
"changes_summary": "",
"metrics": {
"lines_added": 0,
"lines_removed": 0,
"functions_added": 0,
"issues_fixed": 0
}
},
"quality_checks": {
"syntax_check": "passed|failed",
"linting": "passed|failed",
"formatting": "passed|failed"
},
"issues_encountered": [],
"recommendations": [],
"execution_time_seconds": 0
}
```
Work autonomously, complete your task, run quality gates, and return the JSON response.
```
ORCHESTRATION WORKFLOW:
1. **Task Decomposition**: Break main task into 2-8 independent subtasks
2. **Agent Assignment**: Create unique agent IDs (agent_1, agent_2, etc.)
3. **Parallel Launch**: Launch all worker agents simultaneously using Task tool
4. **Monitor Progress**: Track each agent's completion
5. **Collect Results**: Parse JSON responses from each worker agent
6. **Validate Output**: Ensure all quality checks passed
7. **Aggregate Results**: Combine all worker outputs
8. **Generate Report**: Create comprehensive orchestration summary
FINAL ORCHESTRATOR REPORT FORMAT:
```json
{
"orchestration_summary": {
"main_task": "{original task description}",
"total_agents_launched": 0,
"successful_agents": 0,
"failed_agents": 0,
"total_execution_time_seconds": 0
},
"agent_results": [
{worker_agent_json_response_1},
{worker_agent_json_response_2},
...
],
"consolidated_metrics": {
"total_files_modified": 0,
"total_lines_added": 0,
"total_lines_removed": 0,
"total_functions_added": 0,
"total_issues_fixed": 0
},
"quality_validation": {
"all_syntax_checks_passed": true,
"all_linting_passed": true,
"all_formatting_passed": true
},
"consolidated_issues": [],
"consolidated_recommendations": [],
"next_steps": []
}
```
BEST PRACTICES:
- Keep subtasks independent (no dependencies between worker agents)
- Provide complete context to each worker agent
- Launch all agents in parallel for maximum efficiency
- Validate JSON responses from each worker
- Aggregate metrics and results systematically
- Flag any worker failures or incomplete results
- Provide actionable next steps
Work autonomously and orchestrate the complete task execution.
```
### 4. Inform User
After launching the orchestrator, inform the user:
- Orchestrator agent has been launched
- Main task being orchestrated (summary)
- Expected number of worker agents to be spawned
- Estimated completion time (if known)
- The orchestrator will coordinate all work and provide a consolidated JSON report
## Task File Structure
Expected orchestration task file format in `.claude/tasks/`:
```markdown
# Orchestration Task Title
**Date Created**: YYYY-MM-DD
**Priority**: HIGH/MEDIUM/LOW
**Estimated Total Time**: X minutes
**Complexity**: High/Medium/Low
**Recommended Worker Agents**: N
## Main Objective
Clear description of the overall goal
## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
## Suggested Subtask Decomposition
### Subtask 1: Title
**Scope**: Files/components affected
**Estimated Time**: X minutes
**Dependencies**: None or list other subtasks
**Description**: What needs to be done
**Expected Outputs**:
- Output 1
- Output 2
---
### Subtask 2: Title
**Scope**: Files/components affected
**Estimated Time**: X minutes
**Dependencies**: None or list other subtasks
**Description**: What needs to be done
**Expected Outputs**:
- Output 1
- Output 2
---
(Repeat for each suggested subtask)
## Quality Requirements
- All code must pass syntax validation
- All code must pass linting
- All code must be formatted
- All agents must return valid JSON
## Aggregation Requirements
- How to combine results from worker agents
- Validation steps for consolidated output
- Reporting requirements
```
## Examples
### Example 1: Pipeline Optimization
```
User: /orchestrate "Analyze and optimize all gold layer tables for performance"
Orchestrator launches 5 worker agents:
- agent_1: Analyze g_x_mg_* tables
- agent_2: Analyze g_xa_* tables
- agent_3: Review joins and aggregations
- agent_4: Check indexing strategies
- agent_5: Validate query plans
Each agent reports back with JSON results
Orchestrator aggregates findings and produces consolidated report
```
### Example 2: Code Quality Sweep
```
User: /orchestrate code_quality_improvement.md
Orchestrator reads task file with 8 categories
Launches 8 worker agents in parallel:
- agent_1: Fix linting issues in bronze layer
- agent_2: Fix linting issues in silver layer
- agent_3: Fix linting issues in gold layer
- agent_4: Add missing type hints
- agent_5: Update error handling
- agent_6: Improve logging
- agent_7: Optimize imports
- agent_8: Update documentation
Collects JSON from all 8 agents
Validates quality checks
Produces aggregated metrics report
```
### Example 3: Feature Implementation
```
User: /orchestrate "Implement data validation framework across all layers"
Orchestrator decomposes into:
- agent_1: Design validation schema
- agent_2: Implement bronze validators
- agent_3: Implement silver validators
- agent_4: Implement gold validators
- agent_5: Create validation tests
- agent_6: Update documentation
Coordinates execution
Collects results in JSON format
Validates completeness
Generates implementation report
```
## JSON Response Validation
The orchestrator MUST validate each worker agent response contains:
**Required Fields:**
- `agent_id`: String, unique identifier
- `task_assigned`: String, description of assigned work
- `status`: String, one of ["completed", "failed", "partial"]
- `results`: Object with:
- `files_modified`: Array of strings
- `changes_summary`: String
- `metrics`: Object with numeric values
- `quality_checks`: Object with pass/fail values
- `issues_encountered`: Array of strings
- `recommendations`: Array of strings
- `execution_time_seconds`: Number
**Validation Checks:**
- All required fields present
- Status is valid enum value
- Arrays are properly formatted
- Metrics are numeric
- Quality checks are pass/fail
- JSON is well-formed and parseable
## Agent Coordination Patterns
### Pattern 1: Parallel Independent Tasks
```
Orchestrator launches all agents simultaneously
No dependencies between agents
Each agent works on separate files/components
Results aggregated at end
```
### Pattern 2: Sequential with Handoff (Not Recommended)
```
Orchestrator launches agent_1
Waits for agent_1 JSON response
Uses agent_1 results to inform agent_2 prompt
Launches agent_2 with context from agent_1
Continues chain
```
### Pattern 3: Hybrid (Parallel Groups)
```
Orchestrator identifies 2-3 independent groups
Launches all agents in group 1 in parallel
Waits for group 1 completion
Launches all agents in group 2 with context from group 1
Aggregates results from all groups
```
## Success Criteria
Orchestration task completion requires:
- ✅ All worker agents launched successfully
- ✅ All worker agents returned valid JSON responses
- ✅ All quality checks passed across all agents
- ✅ No unresolved issues or failures
- ✅ Consolidated metrics calculated correctly
- ✅ Comprehensive orchestration report provided
- ✅ All files syntax validated
- ✅ All files linted and formatted
## Best Practices
### For Orchestrator Design
- Keep worker tasks independent when possible
- Provide complete context to each worker
- Assign unique, meaningful agent IDs
- Specify clear JSON response requirements
- Validate all JSON responses
- Handle worker failures gracefully
- Aggregate results systematically
- Provide actionable consolidated report
### For Worker Agent Design
- Make each subtask self-contained
- Include all necessary context in prompt
- Specify exact file paths and requirements
- Define clear success criteria
- Require JSON response format
- Include quality gate validation
- Request execution metrics
### For Task Decomposition
- Break into 2-8 independent subtasks
- Avoid inter-agent dependencies
- Balance workload across agents
- Group related work logically
- Consider file/component boundaries
- Respect layer separation (bronze/silver/gold)
## Error Handling
### Worker Agent Failures
If a worker agent fails:
1. Orchestrator captures failure details
2. Marks agent status as "failed" in JSON
3. Continues with other agents
4. Reports failure in final summary
5. Suggests recovery steps
### JSON Parse Errors
If worker returns invalid JSON:
1. Orchestrator logs parse error
2. Attempts to extract partial results
3. Marks agent response as invalid
4. Flags for manual review
5. Continues with valid responses
### Quality Check Failures
If worker's quality checks fail:
1. Orchestrator flags the failure
2. Includes failure details in report
3. Prevents final approval
4. Suggests corrective actions
5. May relaunch worker with corrections
## Performance Optimization
### Parallel Execution
- Launch all independent agents simultaneously
- Use Task tool with multiple concurrent calls
- Maximize parallelism for faster completion
- Monitor resource utilization
### Agent Sizing
- 2-8 agents: Optimal for most tasks
- <2 agents: Consider using single agent instead
- >8 agents: May have coordination overhead
- Balance granularity vs overhead
### Context Management
- Provide minimal necessary context
- Avoid duplicating shared information
- Use references to shared documentation
- Keep prompts focused and concise
## Notes
- Orchestrator coordinates but doesn't do actual code changes
- Worker agents are general-purpose and autonomous
- All communication uses structured JSON format
- Quality validation is mandatory across all agents
- Failed agents don't block other agents
- Orchestrator produces human-readable summary
- JSON enables programmatic result processing
- Pattern scales from 2 to 8 parallel agents
- Best for complex, decomposable tasks
- Overkill for simple, atomic tasks

View File

@@ -0,0 +1,84 @@
---
allowed-tools: Read, Bash, Grep, Glob
argument-hint: [monitoring-type] | --apm | --rum | --custom
description: Setup comprehensive application performance monitoring with metrics, alerting, and observability
---
# Add Performance Monitoring
Setup application performance monitoring: **$ARGUMENTS**
## Instructions
1. **Performance Monitoring Strategy**
- Define key performance indicators (KPIs) and service level objectives (SLOs)
- Identify critical user journeys and performance bottlenecks
- Plan monitoring architecture and data collection strategy
- Assess existing monitoring infrastructure and integration points
- Define alerting thresholds and escalation procedures
2. **Application Performance Monitoring (APM)**
- Set up comprehensive APM solution (New Relic, Datadog, AppDynamics)
- Configure distributed tracing for request lifecycle visibility
- Implement custom metrics and performance tracking
- Set up transaction monitoring and error tracking
- Configure performance profiling and diagnostics
3. **Real User Monitoring (RUM)**
- Implement client-side performance tracking and web vitals monitoring
- Set up user experience metrics collection (LCP, FID, CLS, TTFB)
- Configure custom performance metrics for user interactions
- Monitor page load performance and resource loading
- Track user journey performance across different devices
4. **Server Performance Monitoring**
- Monitor system metrics (CPU, memory, disk, network)
- Set up process and application-level monitoring
- Configure event loop lag and garbage collection monitoring
- Implement custom server performance metrics
- Monitor resource utilization and capacity planning
5. **Database Performance Monitoring**
- Track database query performance and slow query identification
- Monitor database connection pool utilization
- Set up database performance metrics and alerting
- Implement query execution plan analysis
- Monitor database resource usage and optimization opportunities
6. **Error Tracking and Monitoring**
- Implement comprehensive error tracking (Sentry, Bugsnag, Rollbar)
- Configure error categorization and impact analysis
- Set up error alerting and notification systems
- Track error trends and resolution metrics
- Implement error context and debugging information
7. **Custom Metrics and Dashboards**
- Implement business metrics tracking (Prometheus, StatsD)
- Create performance dashboards and visualizations
- Configure custom alerting rules and thresholds
- Set up performance trend analysis and reporting
- Implement performance regression detection
8. **Alerting and Notification System**
- Configure intelligent alerting based on performance thresholds
- Set up multi-channel notifications (email, Slack, PagerDuty)
- Implement alert escalation and on-call procedures
- Configure alert fatigue prevention and noise reduction
- Set up performance incident management workflows
9. **Performance Testing Integration**
- Integrate monitoring with load testing and performance testing
- Set up continuous performance testing and monitoring
- Configure performance baseline tracking and comparison
- Implement performance test result analysis and reporting
- Monitor performance under different load scenarios
10. **Performance Optimization Recommendations**
- Generate actionable performance insights and recommendations
- Implement automated performance analysis and reporting
- Set up performance optimization tracking and measurement
- Configure performance improvement validation
- Create performance optimization prioritization frameworks
Focus on monitoring strategies that provide actionable insights for performance optimization. Ensure monitoring overhead is minimal and doesn't impact application performance.

268
commands/pr-deploy-workflow.md Executable file
View File

@@ -0,0 +1,268 @@
---
model: claude-haiku-4-5-20251001
allowed-tools: SlashCommand, Bash(git:*), mcp__ado__repo_get_pull_request_by_id, mcp__ado__repo_list_pull_requests_by_repo_or_project
argument-hint: [commit-message]
description: Complete deployment workflow - commit, PR to staging, review, then staging to develop
---
# Complete Deployment Workflow
Automates the full deployment workflow with integrated PR review:
1. Commit feature changes and create PR to staging
2. Automatically review the PR for quality and standards
3. Fix any issues identified in review (with iteration loop)
4. After PR is approved and merged, create PR from staging to develop
## What This Does
1. Calls `/pr-feature-to-staging` to commit and create feature → staging PR
2. Calls `/pr-review` to automatically review the PR
3. If review identifies issues → calls `/pr-fix-pr-review` and loops back to review
4. If review passes → waits for user to merge staging PR
5. Calls `/pr-staging-to-develop` to create staging → develop PR
## Implementation Logic
### Step 1: Create Feature PR to Staging
Use `SlashCommand` tool to execute:
```
/pr-feature-to-staging $ARGUMENTS
```
**Expected Output:**
- PR URL and PR ID
- Work item comments added
- Source and target branches confirmed
**Extract from output:**
- PR ID (needed for review step)
- PR number (for user reference)
### Step 2: Automated PR Review
Use `SlashCommand` tool to execute:
```
/pr-review [PR_ID]
```
**The review will evaluate:**
- Code quality and maintainability
- PySpark best practices
- ETL pattern compliance
- Standards compliance from `.claude/rules/python_rules.md`
- DevOps considerations
- Merge conflicts
**Review Outcomes:**
#### Outcome A: Review Passes (PR Approved)
Review output will indicate:
- "PR approved and set to auto-complete"
- No active review comments requiring changes
- All quality gates passed
**Action:** Proceed to Step 4
#### Outcome B: Review Requires Changes
Review output will indicate:
- Active review comments with specific issues
- Quality standards not met
- Files requiring modifications
**Action:** Proceed to Step 3
### Step 3: Fix Review Issues (if needed)
**Only execute if Step 2 identified issues**
Use `SlashCommand` tool to execute:
```
/pr-fix-pr-review [PR_ID]
```
**This will:**
1. Retrieve all active review comments
2. Make code changes to address feedback
3. Run quality gates (syntax, lint, format)
4. Commit fixes and push to feature branch
5. Reply to review threads
6. Update the PR automatically
**After fixes are applied:**
- Loop back to Step 2 to re-review
- Continue iterating until review passes
**Iteration Logic:**
```
LOOP while review has active issues:
1. /pr-fix-pr-review [PR_ID]
2. /pr-review [PR_ID]
3. Check review outcome
4. If approved → exit loop
5. If still has issues → continue loop
END LOOP
```
### Step 4: Wait for Staging PR Merge
After PR review passes and is approved, inform user:
```
✅ PR Review Passed - PR Approved and Ready
PR #[PR_ID] has been reviewed and approved with auto-complete enabled.
Review Summary:
- Code quality: ✓ Passed
- PySpark best practices: ✓ Passed
- ETL patterns: ✓ Passed
- Standards compliance: ✓ Passed
- No merge conflicts
Next Steps:
1. The PR will auto-merge when all policies are satisfied
2. Once merged to staging, I'll create the staging → develop PR
Would you like me to:
a) Create the staging → develop PR now (if staging merge is complete)
b) Wait for you to confirm the staging merge
c) Check the PR status
Enter choice (a/b/c):
```
**User Responses:**
- **a**: Immediately proceed to Step 5
- **b**: Wait for user confirmation, then proceed to Step 5
- **c**: Use `mcp__ado__repo_get_pull_request_by_id` to check if PR is merged, then guide user
### Step 5: Create Staging to Develop PR
Use `SlashCommand` tool to execute:
```
/pr-staging-to-develop
```
**This will:**
1. Create PR: staging → develop
2. Handle any merge conflicts
3. Return PR URL for tracking
**Final Output:**
```
🚀 Deployment Workflow Complete
Feature → Staging:
- PR #[PR_ID] - Reviewed and Merged ✓
Staging → Develop:
- PR #[NEW_PR_ID] - Created and Ready for Review
- URL: [PR_URL]
Summary:
1. Feature PR created and reviewed
2. All quality gates passed
3. PR approved and merged to staging
4. Staging PR created for develop
The workflow is complete. The staging → develop PR is now ready for final review and deployment.
```
## Example Usage
### Full Workflow with Work Item
```bash
/deploy-workflow "feat(gold): add X_MG_Offender linkage table #45497"
```
**This will:**
1. Create commit on feature branch
2. Create PR: feature → staging
3. Comment on work item #45497
4. Automatically review PR for quality
5. Fix any issues identified (with iteration)
6. Wait for staging PR merge
7. Create PR: staging → develop
### Full Workflow Without Work Item
```bash
/deploy-workflow "refactor: optimise session management"
```
**This will:**
1. Create commit on feature branch
2. Create PR: feature → staging
3. Automatically review PR
4. Fix any issues (iterative)
5. Wait for merge confirmation
6. Create staging → develop PR
## Review Iteration Example
**Scenario:** Review finds 3 issues in the initial PR
```
Step 1: /pr-feature-to-staging "feat: add new table"
→ PR #5678 created
Step 2: /pr-review 5678
→ Found 3 issues:
- Missing type hints in function
- Line exceeds 240 characters
- Missing @synapse_error_print_handler decorator
Step 3: /pr-fix-pr-review 5678
→ Fixed all 3 issues
→ Committed and pushed
→ PR updated
Step 2 (again): /pr-review 5678
→ All issues resolved
→ PR approved ✓
Step 4: Wait for merge confirmation
Step 5: /pr-staging-to-develop
→ PR #5679 created (staging → develop)
Complete!
```
## Error Handling
### PR Creation Fails
- Display error from `/pr-feature-to-staging`
- Guide user to resolve (branch validation, git issues)
- Do not proceed to review step
### Review Cannot Complete
- Display specific blocker (merge conflicts, missing files)
- Guide user to manual resolution
- Offer to retry review after fix
### Fix PR Review Fails
- Display specific errors (quality gates, git issues)
- Offer manual intervention option
- Allow user to fix locally and skip to next step
### Staging PR Already Exists
- Use `mcp__ado__repo_list_pull_requests_by_repo_or_project` to check existing PRs
- Inform user of existing PR
- Ask if they want to create anyway or use existing
## Notes
- **Automated Review**: Quality gates are enforced automatically
- **Iterative Fixes**: Will loop through fix → review until approved
- **Semi-Automated Merge**: User must confirm staging merge before final PR
- **Work Item Tracking**: Automatic comments on linked work items
- **Quality First**: Won't proceed if review fails and can't auto-fix
- **Graceful Degradation**: Offers manual intervention at each step if automatisation fails
## Quality Gates Enforced
The integrated `/pr-review` checks:
1. Code quality (type hints, line length, formatting)
2. PySpark best practices (DataFrame ops, logging, session mgmt)
3. ETL pattern compliance (class structure, decorators)
4. Standards from `.claude/rules/python_rules.md`
5. No merge conflicts
6. Proper error handling
All must pass before proceeding to staging → develop PR.

233
commands/pr-feature-to-staging.md Executable file
View File

@@ -0,0 +1,233 @@
---
model: claude-haiku-4-5-20251001
allowed-tools: Bash(git add:*), Bash(git status:*), Bash(git commit:*), Bash(git diff:*), Bash(git log:*), Bash(git push:*), Bash(git pull:*), Bash(git branch:*), mcp__*, mcp__ado__repo_list_branches_by_repo, mcp__ado__repo_search_commits, mcp__ado__repo_create_pull_request, mcp__ado__repo_get_pull_request_by_id, mcp__ado__repo_get_repo_by_name_or_id, mcp__ado__wit_add_work_item_comment, mcp__ado__wit_get_work_item, Read, Glob
argument-hint:
description: Automatically analyze changes and create PR from current feature branch to staging
---
# Create Feature PR to Staging
Automatically analyzes repository changes, generates appropriate commit message, and creates pull request to `staging`.
## Repository Configuration
- **Project**: Program Unify
- **Repository ID**: e030ea00-2f85-4b19-88c3-05a864d7298d
- **Repository Name**: unify_2_1_dm_synapse_env_d10
- **Target Branch**: `staging` (fixed)
- **Source Branch**: Current feature branch
## Current Repository State
- Git status: !`git status --short`
- Current branch: !`git branch --show-current`
- Staged changes: !`git diff --cached --stat`
- Unstaged changes: !`git diff --stat`
- Recent commits: !`git log --oneline -5`
## Implementation Logic
### 1. Validate Current Branch
- Get current branch: `git branch --show-current`
- **REQUIRE**: Branch must start with `feature/`
- **BLOCK**: `staging`, `develop`, `main` branches
- If validation fails: Show clear error and exit
### 2. Analyze Changes and Generate Commit Message
- Run `git status --short` to see modified files
- Run `git diff --stat` to see change statistics
- Run `git diff` to analyze actual code changes
- **Automatically determine**:
- **Type**: Based on file changes (feat, fix, refactor, docs, test, chore, etc.)
- **Scope**: From file paths (bronze, silver, gold, utilities, pipeline, etc.)
- **Description**: Concise summary of what changed (e.g., "add person address table", "fix deduplication logic")
- **Work Items**: Extract from branch name pattern (e.g., feature/46225-description → #46225)
- **Analysis Rules**:
- New files in gold/silver/bronze → `feat`
- Modified transformation logic → `refactor` or `fix`
- Test files → `test`
- Documentation → `docs`
- Utilities/session_optimiser → `refactor` or `feat`
- Multiple file types → prioritize feat > fix > refactor
- Gold layer → scope: `(gold)`
- Silver layer → scope: `(silver)` or `(silver_<database>)`
- Bronze layer → scope: `(bronze)`
- Generate commit message in format: `emoji type(scope): description #workitem`
### 3. Execute Commit Workflow
- Stage all changes: `git add .`
- Create commit with auto-generated emoji conventional format
- Run pre-commit hooks (ruff lint/format, YAML validation, etc.)
- Push to current feature branch
### 4. Create Pull Request
- Use `mcp__ado__repo_create_pull_request` with:
- `repositoryId`: e030ea00-2f85-4b19-88c3-05a864d7298d
- `sourceRefName`: Current feature branch (refs/heads/feature/*)
- `targetRefName`: refs/heads/staging
- `title`: Extract from auto-generated commit message
- `description`: Brief summary with bullet points based on analyzed changes
- Return PR URL to user
### 5. Add Work Item Comments (Automatic)
If PR creation was successful:
- Get work items linked to PR using `mcp__ado__repo_get_pull_request_by_id`
- For each linked work item:
- Verify work item exists with `mcp__ado__wit_get_work_item`
- Generate comment with:
- PR title and number
- Commit message and SHA
- File changes summary from `git diff --stat`
- Link to PR in Azure DevOps
- Link to commit in Azure DevOps
- Add comment using `mcp__ado__wit_add_work_item_comment`
- Use HTML format for rich formatting
- **IMPORTANT**: Do NOT include footer attribution text
- **IMPORTANT**: always use australian english in all messages and descriptions
- **IMPORTANT**: do not mention that you are using australian english in all messages and descriptions
## Commit Message Format
### Type + Emoji Mapping
-`feat`: New feature
- 🐛 `fix`: Bug fix
- 📝 `docs`: Documentation
- 💄 `style`: Formatting/style
- ♻️ `refactor`: Code refactoring
- ⚡️ `perf`: Performance improvements
-`test`: Tests
- 🔧 `chore`: Tooling, configuration
- 🚀 `ci`: CI/CD improvements
- 🗃️ `db`: Database changes
- 🔥 `fix`: Remove code/files
- 📦️ `chore`: Dependencies
- 🚸 `feat`: UX improvements
- 🦺 `feat`: Validation
### Example Format
```
✨ feat(gold): add X_MG_Offender linkage table #45497
```
### Auto-Generation Logic
**File Path Analysis**:
- `python_files/gold/*.py` → scope: `(gold)`
- `python_files/silver/s_fvms_*.py` → scope: `(silver_fvms)` or `(silver)`
- `python_files/silver/s_cms_*.py` → scope: `(silver_cms)` or `(silver)`
- `python_files/bronze/*.py` → scope: `(bronze)`
- `python_files/utilities/*.py` → scope: `(utilities)`
- `python_files/pipeline_operations/*.py` → scope: `(pipeline)`
- `python_files/testing/*.py` → scope: `(test)`
- `.claude/**`, `*.md` → scope: `(docs)`
**Change Type Detection**:
- New files (`A` in git status) → `feat`
- Modified transformation/ETL files → `refactor` ♻️
- Bug fixes (keywords: fix, bug, error, issue) → `fix` 🐛
- Test files → `test`
- Documentation files → `docs` 📝
- Configuration files → `chore` 🔧
**Description Generation**:
- Extract meaningful operation from file names and diffs
- New table: "add <table_name> table"
- Modified logic: "improve/update <functionality>"
- Bug fix: "fix <issue_description>"
- Refactor: "refactor <component> for <reason>"
**Work Item Extraction**:
- Branch name pattern: `feature/<number>-description``#<number>`
- Multiple numbers: Extract first occurrence
- No number in branch: No work item reference added
## What This Command Does
1. Validates you're on a feature branch (feature/*)
2. Analyzes git changes to determine type, scope, and description
3. Extracts work item numbers from branch name
4. Auto-generates commit message with conventional emoji format
5. Stages all modified files
6. Creates commit with auto-generated message
7. Runs pre-commit hooks (auto-fixes code quality issues)
8. Pushes to current feature branch
9. Creates PR from feature branch → staging
10. Automatically adds comments to linked work items with PR details
## Pre-Commit Hooks
Your project uses pre-commit with:
- **Ruff**: Linting with auto-fix + formatting
- **Standard hooks**: Trailing whitespace, YAML/JSON validation
- **Security**: Private key detection
Pre-commit hooks will auto-fix issues and may modify files. The commit process will:
1. Run hooks
2. Auto-stage modified files
3. Complete commit with fixes applied
## Example Usage
### Automatic Feature PR
```bash
/pr-feature-to-staging
```
**On branch**: `feature/46225-add-person-address-table`
**Changed files**: `python_files/gold/g_occ_person_address.py` (new file)
**Auto-generated commit**: `✨ feat(gold): add person address table #46225`
This will:
1. Analyze changes (new gold layer file)
2. Extract work item #46225 from branch name
3. Auto-generate commit message
4. Commit and push to feature branch
5. Create PR: `feature/46225-add-person-address-table → staging`
6. Link work item #46225
7. Add automatic comment to work item #46225 with PR details
### Multiple File Changes
**On branch**: `feature/46789-refactor-deduplication`
**Changed files**:
- `python_files/silver/s_fvms_incident.py` (modified)
- `python_files/silver/s_cms_offence_report.py` (modified)
- `python_files/utilities/session_optimiser.py` (modified)
**Auto-generated commit**: `♻️ refactor(silver): improve deduplication logic #46789`
### Fix Bug
**On branch**: `feature/47123-fix-timestamp-parsing`
**Changed files**: `python_files/utilities/session_optimiser.py` (modified, TableUtilities.clean_date_time_columns)
**Auto-generated commit**: `🐛 fix(utilities): correct timestamp parsing for null values #47123`
## Error Handling
### Not on Feature Branch
```bash
# Error: On staging branch
/pr-feature-to-staging
```
**Result**: ERROR - Must be on feature/* branch. Current: staging
### Invalid Branch
```bash
# Error: On develop or main branch
/pr-feature-to-staging
```
**Result**: ERROR - Cannot create feature PR from develop/main branch
### No Changes to Commit
```bash
# Error: Working directory clean
/pr-feature-to-staging
```
**Result**: ERROR - No changes to commit. Working directory is clean.
## Best Practices
1. **Work on feature branches** - Always create PRs from `feature/*` branches
2. **Include work item in branch name** - Use pattern `feature/<work-item>-description` (e.g., `feature/46225-add-person-address`)
3. **Make focused changes** - Keep changes related to a single feature/fix for accurate commit message generation
4. **Let pre-commit work** - Hooks maintain code quality automatically
5. **Review changes** - Check `git status` before running command to ensure only intended files are modified
6. **Trust the automation** - The command analyzes your changes and generates appropriate conventional commit messages

294
commands/pr-fix-pr-review.md Executable file
View File

@@ -0,0 +1,294 @@
---
model: claude-haiku-4-5-20251001
allowed-tools: Bash(git:*), Read, Edit, Write, Task, mcp__*, mcp__ado__repo_get_pull_request_by_id, mcp__ado__repo_list_pull_request_threads, mcp__ado__repo_list_pull_request_thread_comments, mcp__ado__repo_reply_to_comment, mcp__ado__repo_resolve_comment, mcp__ado__repo_get_repo_by_name_or_id, mcp__ado__wit_add_work_item_comment
argument-hint: [PR_ID]
description: Address PR review feedback and update pull request
---
# Fix PR Review Issues
Address feedback from PR review comments, make necessary code changes, and update the pull request.
## Repository Configuration
- **Project**: Program Unify
- **Repository ID**: d3fa6f02-bfdf-428d-825c-7e7bd4e7f338
- **Repository Name**: unify_2_1_dm_synapse_env_d10
## What This Does
1. Retrieves PR details and all active review comments
2. Analyzes review feedback and identifies required changes
3. Makes code changes to address each review comment
4. Commits changes with descriptive message
5. Pushes to feature branch (automatically updates PR)
6. Replies to review threads confirming fixes
7. Resolves review threads when appropriate
## Implementation Logic
### 1. Get PR Information
- Use \`mcp__ado__repo_get_pull_request_by_id\` with PR_ID from \`$ARGUMENTS\`
- Extract source branch, target branch, and PR title
- Validate PR is still active
### 2. Retrieve Review Comments
- Use \`mcp__ado__repo_list_pull_request_threads\` to get all threads
- Filter for active threads (status = "Active")
- For each thread, use \`mcp__ado__repo_list_pull_request_thread_comments\` to get details
- Display all review comments with:
- File path and line number
- Reviewer name
- Comment content
- Thread ID (for later replies)
### 3. Checkout Feature Branch
\`\`\`bash
git fetch origin
git checkout <source-branch-name>
git pull origin <source-branch-name>
\`\`\`
### 4. Address Each Review Comment
**Categorise review comments first:**
#### Standard Code Quality Issues
Handle directly with Edit tool for:
- Type hints
- Line length violations
- Formatting issues
- Missing decorators
- Import organization
- Variable naming
**Implementation:**
1. Read affected file using Read tool
2. Analyze the feedback and determine required changes
3. Make code changes using Edit tool
4. Validate changes meet project standards
#### Complex PySpark Issues
**Use pyspark-engineer agent for:**
- Performance optimisation requests
- Partitioning strategy changes
- Shuffle optimisation
- Broadcast join refactoring
- Memory management improvements
- Medallion architecture violations
- Complex transformation logic
**Trigger criteria:**
- Review comment mentions: "performance", "optimisation", "partitioning", "shuffle", "memory", "medallion", "bronze/silver/gold layer"
- Files affected in: \`python_files/pipeline_operations/\`, \`python_files/silver/\`, \`python_files/gold/\`, \`python_files/utilities/session_optimiser.py\`
**Use Task tool to launch pyspark-engineer agent:**
\`\`\`
Task tool parameters:
- subagent_type: "pyspark-engineer"
- description: "Implement PySpark fixes for PR #[PR_ID]"
- prompt: "
Address PySpark review feedback for PR #[PR_ID]:
Review Comment Details:
[For each PySpark-related comment, include:]
- File: [FILE_PATH]
- Line: [LINE_NUMBER]
- Reviewer Feedback: [COMMENT_TEXT]
- Thread ID: [THREAD_ID]
Implementation Requirements:
1. Read all affected files
2. Implement fixes following these standards:
- Maximum line length: 240 characters
- No blank lines inside functions
- Proper type hints for all functions
- Use @synapse_error_print_handler decorator
- PySpark DataFrame operations (not SQL)
- Suffix _sdf for all DataFrames
- Follow medallion architecture patterns
3. Optimize for:
- Performance and cost-efficiency
- Data skew handling
- Memory management
- Proper partitioning strategies
4. Ensure production readiness:
- Error handling
- Logging with NotebookLogger
- Idempotent operations
5. Run quality gates:
- Syntax validation: python3 -m py_compile
- Linting: ruff check python_files/
- Formatting: ruff format python_files/
Return:
1. List of files modified
2. Summary of changes made
3. Explanation of how each review comment was addressed
4. Any additional optimisations implemented
"
\`\`\`
**Integration:**
- pyspark-engineer will read, modify, and validate files
- Agent will run quality gates automatically
- You will receive summary of changes
- Use summary for commit message and review replies
#### Validation for All Changes
Regardless of method (direct Edit or pyspark-engineer agent):
- Maximum line length: 240 characters
- No blank lines inside functions
- Proper type hints
- Use of \`@synapse_error_print_handler\` decorator
- PySpark best practices from \`.claude/rules/python_rules.md\`
- Document all fixes for commit message
### 5. Validate Changes
Run quality gates:
\`\`\`bash
# Syntax check
python3 -m py_compile <changed-file>
# Linting
ruff check python_files/
# Format
ruff format python_files/
\`\`\`
### 6. Commit and Push
\`\`\`bash
git add .
git commit -m "♻️ refactor: address PR review feedback - <brief-summary>"
git push origin <source-branch>
\`\`\`
**Commit Message Format:**
\`\`\`
♻️ refactor: address PR review feedback
Fixes applied:
- <file1>: <description of fix>
- <file2>: <description of fix>
- ...
Review comments addressed in PR #<PR_ID>
\`\`\`
### 7. Reply to Review Threads
For each addressed comment:
- Use \`mcp__ado__repo_reply_to_comment\` to add reply:
\`\`\`
✅ Fixed in commit <SHA>
Changes made:
- <specific change description>
\`\`\`
- Use \`mcp__ado__repo_resolve_comment\` to mark thread as resolved (if appropriate)
### 8. Report Results
Provide summary:
\`\`\`
PR Review Fixes Completed
PR: #<PR_ID> - <PR_Title>
Branch: <source-branch> → <target-branch>
Review Comments Addressed: <count>
Files Modified: <file-list>
Commit SHA: <sha>
Quality Gates:
✓ Syntax validation passed
✓ Linting passed
✓ Code formatting applied
The PR has been updated and is ready for re-review.
\`\`\`
## Error Handling
### No PR ID Provided
If \`$ARGUMENTS\` is empty:
- Use \`mcp__ado__repo_list_pull_requests_by_repo_or_project\` to list open PRs
- Display all PRs created by current user
- Prompt user to specify PR ID
### No Active Review Comments
If no active review threads found:
\`\`\`
No active review comments found for PR #<PR_ID>.
The PR may already be approved or have no feedback requiring changes.
Would you like me to re-run /pr-review to check current status?
\`\`\`
### Merge Conflicts
If \`git pull\` results in merge conflicts:
1. Display conflict files
2. Guide user through resolution:
- Show conflicting sections
- Suggest resolution based on context
- Use Edit tool to resolve
3. Complete merge commit
4. Continue with review fixes
### Quality Gate Failures
If syntax check or linting fails:
1. Display specific errors
2. Fix automatically if possible
3. Re-run quality gates
4. Only proceed to commit when all gates pass
## Example Usage
### Fix Review for Specific PR
\`\`\`bash
/pr-fix-pr-review 5642
\`\`\`
### Fix Review for Latest PR
\`\`\`bash
/pr-fix-pr-review
\`\`\`
(Will list your open PRs if ID not provided)
## Best Practices
1. **Read all comments first** - Understand full scope before making changes
2. **Make targeted fixes** - Address each comment specifically
3. **Run quality gates** - Ensure changes meet project standards
4. **Descriptive replies** - Explain what was changed and why
5. **Resolve appropriately** - Only resolve threads when fix is complete
6. **Test locally** - Consider running relevant tests if available
## Integration with /deploy-workflow
This command is automatically called by \`/deploy-workflow\` when:
- \`/pr-review\` identifies issues requiring changes
- The workflow needs to iterate on PR quality before merging
The workflow will loop:
1. \`/pr-review\` → identifies issues (may include pyspark-engineer deep analysis)
2. \`/pr-fix-pr-review\` → addresses issues
- Standard fixes: Direct Edit tool usage
- Complex PySpark fixes: pyspark-engineer agent handles implementation
3. \`/pr-review\` → re-validates
4. Repeat until PR is approved
**PySpark-Engineer Integration:**
- Automatically triggered for performance and architecture issues
- Ensures optimised, production-ready PySpark code
- Maintains consistency with medallion architecture patterns
- Validates test coverage and quality gates
## Notes
- **Automatic PR Update**: Pushing to source branch automatically updates the PR
- **No New PR Created**: This updates the existing PR, doesn't create a new one
- **Preserves History**: All review iterations are preserved in commit history
- **Thread Management**: Replies and resolutions are tracked in Azure DevOps
- **Quality First**: Will not commit changes that fail quality gates
- **Intelligent Delegation**: Routes simple fixes to Edit tool, complex PySpark issues to specialist agent
- **Expert Optimisation**: pyspark-engineer ensures performance and architecture best practices

206
commands/pr-review.md Executable file
View File

@@ -0,0 +1,206 @@
---
model: claude-haiku-4-5-20251001
allowed-tools: Bash(git branch:*), Bash(git status:*), Bash(git log:*), Bash(git diff:*), mcp__*, mcp__ado__repo_get_repo_by_name_or_id, mcp__ado__repo_list_pull_requests_by_repo_or_project, mcp__ado__repo_get_pull_request_by_id, mcp__ado__repo_list_pull_request_threads, mcp__ado__repo_list_pull_request_thread_comments, mcp__ado__repo_create_pull_request_thread, mcp__ado__repo_reply_to_comment, mcp__ado__repo_update_pull_request, mcp__ado__repo_search_commits, mcp__ado__pipelines_get_builds, Read, Task
argument-hint: [PR_ID] (optional - if not provided, will list all open PRs)
# PR Review and Approval
---
## Task
Review open pull requests in the current repository and approve/complete them if they meet quality standards.
## Instructions
### 1. Get Repository Information
- Use `mcp__ado__repo_get_repo_by_name_or_id` with:
- Project: `Program Unify`
- Repository: `unify_2_1_dm_synapse_env_d10`
- Extract repository ID: `d3fa6f02-bfdf-428d-825c-7e7bd4e7f338`
### 2. List Open Pull Requests
- Use `mcp__ado__repo_list_pull_requests_by_repo_or_project` with:
- Repository ID: `d3fa6f02-bfdf-428d-825c-7e7bd4e7f338`
- Status: `Active`
- If `$ARGUMENTS` provided, filter to that specific PR ID
- Display all open PRs with key details (ID, title, source/target branches, author)
### 3. Review Each Pull Request
For each PR (or the specified PR):
#### 3.1 Get PR Details
- Use `mcp__ado__repo_get_pull_request_by_id` to get full PR details
- Check merge status - if conflicts exist, stop and report
#### 3.2 Get PR Changes
- Use `mcp__ado__repo_search_commits` to get commits in the PR
- Identify files changed and scope of changes
#### 3.3 Review Code Quality
Read changed files and evaluate:
1. **Code Quality & Maintainability**
- Proper use of type hints and descriptive variable names
- Maximum line length (240 chars) compliance
- No blank lines inside functions
- Proper import organization
- Use of `@synapse_error_print_handler` decorator
- Proper error handling with meaningful messages
2. **PySpark Best Practices**
- DataFrame operations over raw SQL
- Proper use of `TableUtilities` methods
- Correct logging with `NotebookLogger`
- Proper session management
3. **ETL Pattern Compliance**
- Follows ETL class pattern for Silver/Gold layers
- Proper extract/transform/load method structure
- Correct database and table naming conventions
4. **Standards Compliance**
- Follows project coding standards from `.claude/rules/python_rules.md`
- No missing docstrings (unless explicitly instructed to omit)
- Proper use of configuration from `configuration.yaml`
#### 3.4 Review DevOps Considerations
1. **CI/CD Integration**
- Changes compatible with existing pipeline
- No breaking changes to deployment process
2. **Configuration & Infrastructure**
- Proper environment detection pattern
- Azure integration handled correctly
- No hardcoded paths or credentials
3. **Testing & Quality Gates**
- Syntax validation would pass
- Linting compliance (ruff check)
- Test coverage for new functionality
#### 3.5 Deep PySpark Analysis (Conditional)
**Only execute if PR modifies PySpark ETL code**
Check if PR changes affect:
- `python_files/pipeline_operations/bronze_layer_deployment.py`
- `python_files/pipeline_operations/silver_dag_deployment.py`
- `python_files/pipeline_operations/gold_dag_deployment.py`
- Any files in `python_files/silver/`
- Any files in `python_files/gold/`
- `python_files/utilities/session_optimiser.py`
**If PySpark files are modified, use Task tool to launch pyspark-engineer agent:**
```
Task tool parameters:
- subagent_type: "pyspark-engineer"
- description: "Deep PySpark analysis for PR #[PR_ID]"
- prompt: "
Perform expert-level PySpark analysis for PR #[PR_ID]:
PR Details:
- Title: [PR_TITLE]
- Changed Files: [LIST_OF_CHANGED_FILES]
- Source Branch: [SOURCE_BRANCH]
- Target Branch: [TARGET_BRANCH]
Review Requirements:
1. Read all changed PySpark files
2. Analyze transformation logic for:
- Partitioning strategies and data skew
- Shuffle optimisation opportunities
- Broadcast join usage and optimisation
- Memory management and caching strategies
- DataFrame operation efficiency
3. Validate Medallion Architecture compliance:
- Bronze layer: Raw data preservation patterns
- Silver layer: Cleansing and standardization
- Gold layer: Business model optimisation
4. Check performance considerations:
- Identify potential bottlenecks
- Suggest optimisation opportunities
- Validate cost-efficiency patterns
5. Verify test coverage:
- Check for pytest test files
- Validate test completeness
- Suggest missing test scenarios
6. Review production readiness:
- Error handling for data pipeline failures
- Idempotent operation design
- Monitoring and logging completeness
Provide detailed findings in this format:
## PySpark Analysis Results
### Critical Issues (blocking)
- [List any critical performance or correctness issues]
### Performance Optimisations
- [Specific optimisation recommendations]
### Architecture Compliance
- [Medallion architecture adherence assessment]
### Test Coverage
- [Test completeness and gaps]
### Recommendations
- [Specific actionable improvements]
Return your analysis for integration into the PR review.
"
```
**Integration of PySpark Analysis:**
- If pyspark-engineer identifies critical issues → Add to review comments
- If optimisations suggested → Add as optional improvement comments
- If architecture violations found → Add as required changes
- Include all findings in final review summary
### 4. Provide Review Comments
- Use `mcp__ado__repo_list_pull_request_threads` to check existing review comments
- If issues found, use `mcp__ado__repo_create_pull_request_thread` to add:
- Specific file-level comments with line numbers
- Clear description of issues
- Suggested improvements
- Mark as `Active` status if changes required
### 5. Approve and Complete PR (if satisfied)
**Only proceed if ALL criteria met:**
- No merge conflicts
- Code quality standards met
- PySpark best practices followed
- ETL patterns correct
- No DevOps concerns
- Proper error handling and logging
- Standards compliant
- **PySpark analysis (if performed) shows no critical issues**
- **Performance optimisations either implemented or deferred with justification**
- **Medallion architecture compliance validated**
**If approved:**
1. Use `mcp__ado__repo_update_pull_request` with:
- Set `autoComplete: true`
- Set `mergeStrategy: "NoFastForward"` (or "Squash" if many small commits)
- Set `deleteSourceBranch: false` (preserve branch history)
- Set `transitionWorkItems: true`
- Add approval comment explaining what was reviewed
2. Confirm completion with summary:
- PR ID and title
- Number of commits reviewed
- Key changes identified
- Approval rationale
### 6. Report Results
Provide comprehensive summary:
- Total open PRs reviewed
- PRs approved and completed (with IDs)
- PRs requiring changes (with summary of issues)
- PRs blocked by merge conflicts
- **PySpark analysis findings (if performed)**
- **Performance optimisation recommendations**
## Important Notes
- **No deferrals**: All identified issues must be addressed before approval
- **Immediate action**: If improvements needed, request them now - no "future work" comments
- **Thorough review**: Check both code quality AND DevOps considerations
- **Professional objectivity**: Prioritize technical accuracy over validation
- **Merge conflicts**: Do NOT approve PRs with merge conflicts - report them for manual resolution

View File

@@ -0,0 +1,35 @@
---
model: claude-haiku-4-5-20251001
allowed-tools: Bash(git add:*), Bash(git status:*), Bash(git commit:*), Bash(git diff:*), Bash(git log:*), Bash(git push:*), Bash(git pull:*), Bash(git branch:*), mcp__*, mcp__ado__repo_list_branches_by_repo, mcp__ado__repo_search_commits, mcp__ado__repo_create_pull_request, mcp__ado__repo_get_pull_request_by_id, mcp__ado__repo_get_repo_by_name_or_id, mcp__ado__wit_add_work_item_comment, mcp__ado__wit_get_work_item
argument-hint: [message] | --no-verify | --amend | --pr-s | --pr-d | --pr-m
# Create Remote PR: staging → develop
---
## Task
Create a pull request from remote `staging` branch to remote `develop` branch using Azure DevOps MCP tools.
## Instructions
### 1. Create PR
- Use `mcp__ado__repo_create_pull_request` tool
- Source: `refs/heads/staging` (remote only - do NOT push local branches)
- Target: `refs/heads/develop`
- Repository ID: `d3fa6f02-bfdf-428d-825c-7e7bd4e7f338`
- Title: Clear, concise description with conventional commit emoji
- Description: Brief bullet points summarising changes (keep short)
### 2. Check for Merge Conflicts
- Use `mcp__ado__repo_get_pull_request_by_id` to verify PR status
- If merge conflicts exist, resolve them:
1. Create temporary branch from `origin/staging`
2. Merge `origin/develop` into temp branch
3. Resolve conflicts using Edit tool
4. Commit resolution: `🔀 Merge origin/develop into staging - resolve conflicts for PR #XXXX`
5. Push resolved merge to `origin/staging`
6. Clean up temp branch
### 3. Success Criteria
- PR created successfully
- No merge conflicts preventing approval
- PR ready for reviewer approval
storageexplorer://v=1&accountid=%2Fsubscriptions%2F646e3673-7a99-4617-9f7e-47857fa18002%2FresourceGroups%2FAuE-Atlas-DataPlatform-DEV-RG%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2Fauedatamigdevlake&subscriptionid=646e3673-7a99-4617-9f7e-47857fa18002&resourcetype=Azure.FileShare&resourcename=atldev01ndsdb1

183
commands/prime-claude.md Normal file
View File

@@ -0,0 +1,183 @@
---
name: prime-claude-md
description: Distill CLAUDE.md to essentials, moving detailed knowledge into skills for on-demand loading. Reduces context pollution by 80-90%.
args: [--analyze-only] | [--backup] | [--apply]
---
# Prime CLAUDE.md
Distill your CLAUDE.md file to only essential information, moving detailed knowledge into skills.
## Problem
Large CLAUDE.md files (400+ lines) are loaded into context for EVERY conversation:
- Wastes 5,000-15,000 tokens per conversation
- Reduces space for actual work
- Slows Claude's responses
- 80% of the content is rarely needed
## Solution
**Prime your CLAUDE.md**:
1. Keep only critical architecture and coding standards
2. Move detailed knowledge into skills (loaded on-demand)
3. Reduce from 400+ lines to ~100 lines
4. Save 80-90% context per conversation
## Usage
### Analyze Current CLAUDE.md
```bash
/prime-claude-md --analyze-only
```
Shows what would be moved to skills without making changes.
### Create Backup and Apply
```bash
/prime-claude-md --backup --apply
```
1. Backs up current CLAUDE.md to CLAUDE.md.backup
2. Creates supporting skills with detailed knowledge
3. Replaces CLAUDE.md with distilled version
4. Documents what was moved where
### Just Apply (No Backup)
```bash
/prime-claude-md --apply
```
## What Gets Distilled
### Kept in CLAUDE.md (Essential)
- Critical architecture concepts (high-level only)
- Mandatory coding standards (line length, blank lines, decorators)
- Quality gates (syntax check, linting, formatting)
- Essential commands (2-3 most common)
- References to skills for details
### Moved to Skills (Detailed Knowledge)
**project-architecture** skill:
- Detailed medallion architecture
- Pipeline execution flow
- Data source details
- Azure integration specifics
- Configuration management
- Testing architecture
**project-commands** skill:
- Complete make command reference
- All development workflows
- Azure operations
- Database operations
- Git operations
- Troubleshooting commands
**pyspark-patterns** skill:
- TableUtilities method documentation
- ETL class pattern details
- Logging standards
- DataFrame operation patterns
- JDBC connection patterns
- Performance tips
## Results
**Before Priming**:
- CLAUDE.md: 420 lines
- Context cost: ~12,000 tokens per conversation
- Skills: 0
- Knowledge: Always loaded
**After Priming**:
- CLAUDE.md: ~100 lines (76% reduction)
- Context cost: ~2,000 tokens per conversation (83% savings)
- Skills: 3 specialized skills
- Knowledge: Loaded only when needed
## Example Distilled CLAUDE.md
```markdown
# CLAUDE.md
**CRITICAL**: READ `.claude/rules/python_rules.md`
## Architecture
Medallion: Bronze → Silver → Gold
Core: `session_optimiser.py` (SparkOptimiser, NotebookLogger, TableUtilities)
## Essential Commands
python3 -m py_compile <file> # Must run
ruff check python_files/ # Must pass
make run_all # Full pipeline
## Coding Standards
- Line length: 240 chars
- No blank lines in functions
- Use @synapse_error_print_handler
- Use logger (not print)
## Skills Available
- project-architecture: Detailed architecture
- project-commands: Complete command reference
- pyspark-patterns: PySpark best practices
```
## Benefits
1. **Faster conversations**: Less context overhead
2. **Better responses**: More room for actual work
3. **On-demand knowledge**: Load only what you need
4. **Maintainable**: Easier to update focused skills
5. **Reusable pattern**: Apply to any repository
## Applying to Other Repositories
This command is repository-agnostic. To use on another repo:
1. Run `/prime-claude-md --analyze-only` to see what you have
2. Command will identify:
- Architectural concepts
- Command references
- Coding standards
- Configuration details
3. Creates appropriate skills based on content
4. Run `/prime-claude-md --apply` when ready
## Files Created
```
.claude/
├── CLAUDE.md # Distilled (100 lines)
├── CLAUDE.md.backup # Original (if --backup used)
└── skills/
├── project-architecture/
│ └── skill.md # Architecture details
├── project-commands/
│ └── skill.md # Command reference
└── pyspark-patterns/ # (project-specific)
└── skill.md # Code patterns
```
## Philosophy
**CLAUDE.md should answer**: "What's special about this repo?"
**Skills should answer**: "How do I do X in detail?"
## Task Execution
I will:
1. Read current CLAUDE.md (both project and global if exists)
2. Analyze content and categorize
3. Create distilled CLAUDE.md (essential only)
4. Create supporting skills with detailed knowledge
5. If --backup: Save CLAUDE.md.backup
6. If --apply: Replace CLAUDE.md with distilled version
7. Generate summary report of changes
---
**Current Project**: Unify Data Migration (PySpark/Azure Synapse)
Let me analyze your CLAUDE.md and create the distilled version with supporting skills.

607
commands/pyspark-errors.md Executable file
View File

@@ -0,0 +1,607 @@
# PySpark Error Fixing Command
## Objective
Execute `make gold_table` and systematically fix all errors encountered in the PySpark gold layer file using specialized agents. Errors may be code-based (syntax, type, runtime) or logical (incorrect joins, missing data, business rule violations).
## Agent Workflow (MANDATORY)
### Phase 1: Error Fixing with pyspark-engineer
**CRITICAL**: All PySpark error fixing MUST be performed by the `pyspark-engineer` agent. Do NOT attempt to fix errors directly.
1. Launch the `pyspark-engineer` agent with:
- Full error stack trace and context
- Target file path
- All relevant schema information from MCP server
- Data dictionary references
2. The pyspark-engineer will:
- Validate MCP server connectivity
- Query schemas and foreign key relationships
- Analyze and fix all errors systematically
- Apply fixes following project coding standards
- Run quality gates (py_compile, ruff check, ruff format)
### Phase 2: Code Review with code-reviewer
**CRITICAL**: After pyspark-engineer completes fixes, MUST launch the `code-reviewer` agent.
1. Launch the `code-reviewer` agent with:
- Path to the fixed file(s)
- Context: "PySpark gold layer error fixes"
- Request comprehensive review focusing on:
- PySpark best practices
- Join logic correctness
- Schema alignment
- Business rule implementation
- Code quality and standards adherence
2. The code-reviewer will provide:
- Detailed feedback on all issues found
- Security vulnerabilities
- Performance optimization opportunities
- Code quality improvements needed
### Phase 3: Iterative Refinement (MANDATORY LOOP)
**CRITICAL**: The review-refactor cycle MUST continue until code-reviewer is 100% satisfied.
1. If code-reviewer identifies ANY issues:
- Launch pyspark-engineer again with code-reviewer's feedback
- pyspark-engineer implements all recommended changes
- Launch code-reviewer again to re-validate
2. Repeat Phase 1 → Phase 2 → Phase 3 until:
- code-reviewer explicitly states: "✓ 100% SATISFIED - No further changes required"
- Zero issues, warnings, or concerns remain
- All quality gates pass
- All business rules validated
3. Only then is the error fixing task complete.
**DO NOT PROCEED TO COMPLETION** until code-reviewer gives explicit 100% satisfaction confirmation.
## Pre-Execution Requirements
### 1. Python Coding Standards (CRITICAL - READ FIRST)
**MANDATORY**: All code MUST follow `.claude/rules/python_rules.md` standards:
- **Line 19**: Use DataFrame API not Spark SQL
- **Line 20**: Do NOT use DataFrame aliases (e.g., `.alias("l")`) or `col()` function - use direct string references or `df["column"]` syntax
- **Line 8**: Limit line length to 240 characters
- **Line 9-10**: Single line per statement, no carriage returns mid-statement
- **Line 10, 12**: No blank lines inside functions
- **Line 11**: Close parentheses on the last line of code
- **Line 5**: Use type hints for all function parameters and return values
- **Line 18**: Import statements only at the start of file, never inside functions
- **Line 16**: Run `ruff check` and `ruff format` before finalizing
- Import only necessary PySpark functions: `from pyspark.sql.functions import when, coalesce, lit` (NO col() usage - use direct references instead)
### 2. Identify Target File
- Default target: `python_files/gold/<INSERT FILE NAME>.py`
- Override via Makefile: `G_RUN_FILE_NAME` variable (line 63)
- Verify file exists before execution
### 3. Environment Context
- **Runtime Environment**: Local development (not Azure Synapse)
- **Working Directory**: `/workspaces/unify_2_1_dm_synapse_env_d10`
- **Python Version**: 3.11+
- **Spark Mode**: Local cluster (`local[*]`)
- **Data Location**: `/workspaces/data` (parquet files)
### 4. Available Resources
- **Data Dictionary**: `.claude/data_dictionary/*.md` - schema definitions for all CMS, FVMS, NicheRMS tables
- **Configuration**: `configuration.yaml` - database lists, null replacements, Azure settings
- **MCP Schema Server**: `mcp-server-motherduck` - live schema access via MCP (REQUIRED for schema verification)
- **Utilities Module**: `python_files/utilities/session_optimiser.py` - TableUtilities, NotebookLogger, decorators
- **Example Files**: Other `python_files/gold/g_*.py` files for reference patterns
### 5. MCP Server Validation (CRITICAL)
**BEFORE PROCEEDING**, verify MCP server connectivity:
1. **Test MCP Server Connection**:
- Attempt to query any known table schema via MCP
- Example test: Query schema for a common table (e.g., `silver_cms.s_cms_offence_report`)
2. **Validation Criteria**:
- MCP server must respond with valid schema data
- Schema must include column names, data types, and nullability
- Response must be recent (not cached/stale data)
3. **Failure Handling**:
```
⚠️ STOP: MCP Server Not Available
The MCP server (mcp-server-motherduck) is not responding or not providing valid schema data.
This command requires live schema access to:
- Verify column names and data types
- Validate join key compatibility
- Check foreign key relationships
- Ensure accurate schema matching
Actions Required:
1. Check MCP server status and configuration
2. Verify MotherDuck connection credentials
3. Ensure schema database is accessible
4. Restart MCP server if necessary
Cannot proceed with error fixing without verified schema access.
Use data dictionary files as fallback, but warn user of potential schema drift.
```
4. **Success Confirmation**:
```
✓ MCP Server Connected
✓ Schema data available
✓ Proceeding with error fixing workflow
```
## Error Detection Strategy
### Phase 1: Execute and Capture Errors
1. Run: `make gold_table`
2. Capture full stack trace including:
- Error type (AttributeError, KeyError, AnalysisException, etc.)
- Line number and function name
- Failed DataFrame operation
- Column names involved
- Join conditions if applicable
### Phase 2: Categorize Error Types
#### A. Code-Based Errors
**Syntax/Import Errors**
- Missing imports from `pyspark.sql.functions`
- Incorrect function signatures
- Type hint violations
- Decorator usage errors
**Runtime Errors**
- `AnalysisException`: Column not found, table doesn't exist
- `AttributeError`: Calling non-existent DataFrame methods
- `KeyError`: Dictionary access failures
- `TypeError`: Incompatible data types in operations
**DataFrame Schema Errors**
- Column name mismatches (case sensitivity)
- Duplicate column names after joins
- Missing required columns for downstream operations
- Incorrect column aliases
#### B. Logical Errors
**Join Issues**
- **Incorrect Join Keys**: Joining on wrong columns (e.g., `offence_report_id` vs `cms_offence_report_id`)
- **Missing Table Aliases**: Ambiguous column references after joins
- **Wrong Join Types**: Using `inner` when `left` is required (or vice versa)
- **Cartesian Products**: Missing join conditions causing data explosion
- **Broadcast Misuse**: Not using `broadcast()` for small dimension tables
- **Duplicate Join Keys**: Multiple rows with same key causing row multiplication
**Aggregation Problems**
- Incorrect `groupBy()` columns
- Missing aggregation functions (`first()`, `last()`, `collect_list()`)
- Wrong window specifications
- Aggregating on nullable columns without `coalesce()`
**Business Rule Violations**
- Incorrect date/time logic (e.g., using `reported_date_time` when `date_created` should be fallback)
- Missing null handling for critical fields
- Status code logic errors
- Incorrect coalesce order
**Data Quality Issues**
- Expected vs actual row counts (use `logger.info(f"Expected X rows, got {df.count()}")`)
- Null propagation in critical columns
- Duplicate records not being handled
- Missing deduplication logic
## Systematic Debugging Process
### Step 1: Schema Verification
For each source table mentioned in the error:
1. **PRIMARY: Query MCP Server for Schema** (MANDATORY FIRST STEP):
- Use MCP tools to query table schema from MotherDuck
- Extract column names, data types, nullability, and constraints
- Verify foreign key relationships for join operations
- Cross-reference with error column names
**Example MCP Query Pattern**:
```
Query: "Get schema for table silver_cms.s_cms_offence_report"
Expected Response: Column list with types and constraints
```
**If MCP Server Fails**:
- STOP and warn user (see Section 4: MCP Server Validation)
- Do NOT proceed with fixing without schema verification
- Suggest user check MCP server configuration
2. **SECONDARY: Verify Schema Using Data Dictionary** (as supplementary reference):
- Read `.claude/data_dictionary/{source}_{table}.md`
- Compare MCP schema vs data dictionary for consistency
- Note any schema drift or discrepancies
- Alert user if schemas don't match
3. **Check Table Existence**:
```python
spark.sql("SHOW TABLES IN silver_cms").show()
```
4. **Inspect Actual Runtime Schema** (validate MCP data):
```python
df = spark.read.table("silver_cms.s_cms_offence_report")
df.printSchema()
df.select([col for col in df.columns[:10]]).show(5, truncate=False)
```
**Compare**:
- MCP schema vs Spark runtime schema
- Report any mismatches to user
- Use runtime schema as source of truth if conflicts exist
5. **Use DuckDB Schema** (if available, as additional validation):
- Query schema.db for column definitions
- Check foreign key relationships
- Validate join key data types
- Triangulate: MCP + DuckDB + Data Dictionary should align
### Step 2: Join Logic Validation
For each join operation:
1. **Use MCP Server to Validate Join Relationships**:
- Query foreign key constraints from MCP schema server
- Identify correct join column names and data types
- Verify parent-child table relationships
- Confirm join key nullability (affects join results)
**Example MCP Queries**:
```
Query: "Show foreign keys for table silver_cms.s_cms_offence_report"
Query: "What columns link s_cms_offence_report to s_cms_case_file?"
Query: "Get data type for column cms_offence_report_id in silver_cms.s_cms_offence_report"
```
**If MCP Returns No Foreign Keys**:
- Fall back to data dictionary documentation
- Check `.claude/data_dictionary/` for relationship diagrams
- Manually verify join logic with business analyst
2. **Verify Join Keys Exist** (using MCP-confirmed column names):
```python
left_df.select("join_key_column").show(5)
right_df.select("join_key_column").show(5)
```
3. **Check Join Key Data Type Compatibility** (cross-reference with MCP schema):
```python
# Verify types match MCP schema expectations
left_df.select("join_key_column").dtypes
right_df.select("join_key_column").dtypes
```
4. **Check Join Key Uniqueness**:
```python
left_df.groupBy("join_key_column").count().filter("count > 1").show()
```
5. **Validate Join Type**:
- `left`: Keep all left records (most common for fact-to-dimension)
- `inner`: Only matching records
- Use `broadcast()` for small lookup tables (< 10MB)
- Confirm join type matches MCP foreign key relationship (nullable FK → left join)
6. **Handle Ambiguous Columns**:
```python
# BEFORE (causes ambiguity if both tables have same column names)
joined_df = left_df.join(right_df, on="common_id", how="left")
# AFTER (select specific columns to avoid ambiguity)
left_cols = [c for c in left_df.columns]
right_cols = ["dimension_field"]
joined_df = left_df.join(right_df, on="common_id", how="left").select(left_cols + right_cols)
```
### Step 3: Aggregation Verification
1. **Check groupBy Columns**:
- Must include all columns not being aggregated
- Verify columns exist in DataFrame
2. **Validate Aggregation Functions**:
```python
from pyspark.sql.functions import min, max, first, count, sum, coalesce, lit
aggregated = df.groupBy("key").agg(min("date_column").alias("earliest_date"), max("date_column").alias("latest_date"), first("dimension_column", ignorenulls=True).alias("dimension"), count("*").alias("record_count"), coalesce(sum("amount"), lit(0)).alias("total_amount"))
```
3. **Test Aggregation Logic**:
- Run aggregation on small sample
- Compare counts before/after
- Check for unexpected nulls
### Step 4: Business Rule Testing
1. **Verify Timestamp Logic**:
```python
from pyspark.sql.functions import when
df.select("reported_date_time", "date_created", when(df["reported_date_time"].isNotNull(), df["reported_date_time"]).otherwise(df["date_created"]).alias("final_timestamp")).show(10)
```
2. **Test Null Handling**:
```python
from pyspark.sql.functions import coalesce, lit
df.select("primary_field", "fallback_field", coalesce(df["primary_field"], df["fallback_field"], lit(0)).alias("result")).show(10)
```
3. **Validate Status/Lookup Logic**:
- Check status code mappings against data dictionary
- Verify conditional logic matches business requirements
## Common Error Patterns and Fixes
### Pattern 1: Column Not Found After Join
**Error**: `AnalysisException: Column 'offence_report_id' not found`
**Root Cause**: Incorrect column name - verify column exists using MCP schema
**Fix**:
```python
# BEFORE - wrong column name
df = left_df.join(right_df, on="offence_report_id", how="left")
# AFTER - MCP-verified correct column name
df = left_df.join(right_df, on="cms_offence_report_id", how="left")
# If joining on different column names between tables:
df = left_df.join(
right_df,
left_df["cms_offence_report_id"] == right_df["offence_report_id"],
how="left"
)
```
### Pattern 2: Duplicate Column Names
**Error**: Multiple columns with same name causing selection issues
**Fix**:
```python
# BEFORE - causes duplicate 'id' column
joined = left_df.join(right_df, left_df["id"] == right_df["id"], how="left")
# AFTER - drop duplicate from right table before join
right_df_clean = right_df.drop("id")
joined = left_df.join(right_df_clean, left_df["id"] == right_df["id"], how="left")
# OR - rename columns to avoid duplicates
right_df_renamed = right_df.withColumnRenamed("id", "related_id")
joined = left_df.join(right_df_renamed, left_df["id"] == right_df_renamed["related_id"], how="left")
```
### Pattern 3: Incorrect Aggregation
**Error**: Column not in GROUP BY causing aggregation failure
**Fix**:
```python
from pyspark.sql.functions import min, first
# BEFORE - non-aggregated column not in groupBy
df.groupBy("key1").agg(min("date_field"), "non_aggregated_field")
# AFTER - all non-grouped columns must be aggregated
df = df.groupBy("key1").agg(min("date_field").alias("min_date"), first("non_aggregated_field", ignorenulls=True).alias("non_aggregated_field"))
```
### Pattern 4: Join Key Mismatch
**Error**: No matching records or unexpected cartesian product
**Fix**:
```python
left_df.select("join_key").show(20)
right_df.select("join_key").show(20)
left_df.select("join_key").dtypes
right_df.select("join_key").dtypes
left_df.filter(left_df["join_key"].isNull()).count()
right_df.filter(right_df["join_key"].isNull()).count()
result = left_df.join(right_df, left_df["join_key"].cast("int") == right_df["join_key"].cast("int"), how="left")
```
### Pattern 5: Missing Null Handling
**Error**: Unexpected nulls propagating through transformations
**Fix**:
```python
from pyspark.sql.functions import coalesce, lit
# BEFORE - NULL if either field is NULL
df = df.withColumn("result", df["field1"] + df["field2"])
# AFTER - handle nulls with coalesce
df = df.withColumn("result", coalesce(df["field1"], lit(0)) + coalesce(df["field2"], lit(0)))
```
## Validation Requirements
After fixing errors, validate:
1. **Row Counts**: Log and verify expected vs actual counts at each transformation
2. **Schema**: Ensure output schema matches target table requirements
3. **Nulls**: Check critical columns for unexpected nulls
4. **Duplicates**: Verify uniqueness of ID columns
5. **Data Ranges**: Check timestamp ranges and numeric bounds
6. **Join Results**: Sample joined records to verify correctness
## Logging Requirements
Use `NotebookLogger` throughout:
```python
logger = NotebookLogger()
# Start of operation
logger.info(f"Starting extraction from {table_name}")
# After DataFrame creation
logger.info(f"Extracted {df.count()} records from {table_name}")
# After join
logger.info(f"Join completed: {joined_df.count()} records (expected ~X)")
# After transformation
logger.info(f"Transformation complete: {final_df.count()} records")
# On error
logger.error(f"Failed to process {table_name}: {error_message}")
# On success
logger.success(f"Successfully loaded {target_table_name}")
```
## Quality Gates (Must Run After Fixes)
```bash
# 1. Syntax validation
python3 -m py_compile python_files/gold/g_x_mg_cms_mo.py
# 2. Code quality check
ruff check python_files/gold/g_x_mg_cms_mo.py
# 3. Format code
ruff format python_files/gold/g_x_mg_cms_mo.py
# 4. Run fixed code
make gold_table
```
## Key Principles for PySpark Engineer Agent
1. **CRITICAL: Agent Workflow Required**: ALL error fixing must follow the 3-phase agent workflow (pyspark-engineer → code-reviewer → iterative refinement until 100% satisfied)
2. **CRITICAL: Validate MCP Server First**: Before starting, verify MCP server connectivity and schema availability. STOP and warn user if unavailable.
3. **Always Query MCP Schema First**: Use MCP server to get authoritative schema data before fixing any errors. Cross-reference with data dictionary.
4. **Use MCP for Join Validation**: Query foreign key relationships from MCP to ensure correct join logic and column names.
5. **DataFrame API Without Aliases or col()**: Use DataFrame API (NOT Spark SQL). NO DataFrame aliases. NO col() function. Use direct string references (e.g., `"column_name"`) or df["column"] syntax (e.g., `df["column_name"]`). Import only needed functions (e.g., `from pyspark.sql.functions import when, coalesce`)
6. **Test Incrementally**: Fix one error at a time, validate, then proceed
7. **Log Everything**: Add logging at every transformation step
8. **Handle Nulls**: Always consider null cases in business logic (check MCP nullability constraints)
9. **Verify Join Logic**: Check join keys, types, and uniqueness before implementing (use MCP data types)
10. **Use Utilities**: Leverage `TableUtilities` methods (add_row_hash, save_as_table, clean_date_time_columns)
11. **Follow Patterns**: Reference working gold layer files for established patterns
12. **Validate Business Rules**: Confirm logic with MCP schema, data dictionary, and user story requirements
13. **Clean Code**: Adhere to project standards (240 char line length, no blank lines in functions, type hints, single line per statement)
14. **Triple-Check Schemas**: When schema mismatch occurs, verify MCP → Runtime → Data Dictionary alignment and report discrepancies
15. **Code Review Loop**: Continue refactoring until code-reviewer explicitly confirms 100% satisfaction with zero remaining issues
## Example Workflow with MCP Integration
### Scenario: Fixing AnalysisException - Column 'offence_report_id' not found
**Step 1: MCP Server Validation**
```
Agent: Testing MCP server connectivity...
MCP Query: "List databases available"
MCP Response: ✓ [bronze_cms, silver_cms, silver_fvms, gold_data_model, ...]
Agent: ✓ MCP Server Connected
Agent: ✓ Schema data available
Agent: ✓ Proceeding with error fixing workflow
```
**Step 2: Query Schema via MCP**
```
Agent: Querying schema for silver_cms.s_cms_offence_report via MCP...
MCP Query: "DESCRIBE TABLE silver_cms.s_cms_offence_report"
MCP Response:
- cms_offence_report_id (BIGINT, PRIMARY KEY, NOT NULL)
- case_file_id (BIGINT, FOREIGN KEY → s_cms_case_file.cms_case_file_id, NULLABLE)
- reported_date_time (TIMESTAMP, NULLABLE)
- date_created (TIMESTAMP, NOT NULL)
- status_code (VARCHAR, NULLABLE)
... (50+ more columns)
Agent: Found correct column name: 'cms_offence_report_id' (not 'offence_report_id')
```
**Step 3: Verify Join Relationship via MCP**
```
Agent: Checking foreign key relationship for join...
MCP Query: "Show foreign keys from s_cms_offence_report to s_cms_case_file"
MCP Response:
- s_cms_offence_report.case_file_id → s_cms_case_file.cms_case_file_id
- Relationship: MANY-TO-ONE
- Nullability: case_file_id is NULLABLE → Use LEFT JOIN
Agent: ✓ Join should be: LEFT JOIN on case_file_id = cms_case_file_id
```
**Step 4: Apply Fix with MCP-Verified Schema**
```python
# BEFORE (error)
offence_df = spark.read.table("silver_cms.s_cms_offence_report")
case_df = spark.read.table("silver_cms.s_cms_case_file")
joined = offence_df.join(case_df, on="offence_report_id", how="left")
# AFTER (MCP-verified) - Use DataFrame API with correct column names
# MCP-confirmed FK: case_file_id → cms_case_file_id
# MCP-confirmed nullable FK → LEFT JOIN
# MCP-confirmed PK: cms_offence_report_id
case_df_subset = case_df.select("cms_case_file_id", "case_file_number")
joined = offence_df.join(case_df_subset, offence_df["case_file_id"] == case_df_subset["cms_case_file_id"], how="left")
joined = joined.withColumnRenamed("cms_offence_report_id", "offence_report_id")
```
**Step 5: Validate Fix**
```
Agent: Running fixed code...
Agent: ✓ No AnalysisException
Agent: ✓ Join executed successfully
Agent: Row count: 45,823 (expected ~45,000)
Agent: ✓ Schema matches MCP expectations
```
## Success Criteria
### Phase 1: Initial Error Fixing (pyspark-engineer)
- [ ] **MCP Server validated and responding** (MANDATORY FIRST CHECK)
- [ ] Schema verified via MCP server for all source tables
- [ ] Foreign key relationships confirmed via MCP queries
- [ ] All syntax errors resolved
- [ ] All runtime errors fixed
- [ ] Join logic validated and correct (using MCP-confirmed column names and types)
- [ ] DataFrame API used (NOT Spark SQL) per python_rules.md line 19
- [ ] NO DataFrame aliases or col() function used - direct string references or df["column"] syntax only (per python_rules.md line 20)
- [ ] Code follows python_rules.md standards: 240 char lines, no blank lines in functions, single line per statement, imports at top only
- [ ] Row counts logged and reasonable
- [ ] Business rules implemented correctly
- [ ] Output schema matches requirements (cross-referenced with MCP schema)
- [ ] Code passes quality gates (py_compile, ruff check, ruff format)
- [ ] `make gold_table` executes successfully
- [ ] Target table created/updated in `gold_data_model` database
- [ ] No schema drift reported between MCP, Runtime, and Data Dictionary sources
### Phase 2: Code Review (code-reviewer)
- [ ] code-reviewer agent launched with fixed code
- [ ] Comprehensive review completed covering:
- [ ] PySpark best practices adherence
- [ ] Join logic correctness
- [ ] Schema alignment validation
- [ ] Business rule implementation accuracy
- [ ] Code quality and standards compliance
- [ ] Security vulnerabilities (none found)
- [ ] Performance optimization opportunities addressed
### Phase 3: Iterative Refinement (MANDATORY UNTIL 100% SATISFIED)
- [ ] All code-reviewer feedback items addressed by pyspark-engineer
- [ ] Re-review completed by code-reviewer
- [ ] Iteration cycle repeated until code-reviewer explicitly confirms:
- [ ] **"✓ 100% SATISFIED - No further changes required"**
- [ ] Zero remaining issues, warnings, or concerns
- [ ] All quality gates pass
- [ ] All business rules validated
- [ ] Code meets production-ready standards
### Final Approval
- [ ] **code-reviewer has explicitly confirmed 100% satisfaction**
- [ ] No outstanding issues or concerns remain
- [ ] Task is complete and ready for production deployment

116
commands/refactor-code.md Executable file
View File

@@ -0,0 +1,116 @@
# Intelligently Refactor and Improve Code Quality
Intelligently refactor and improve code quality
## Instructions
Follow this systematic approach to refactor code: **$ARGUMENTS**
1. **Pre-Refactoring Analysis**
- Identify the code that needs refactoring and the reasons why
- Understand the current functionality and behavior completely
- Review existing tests and documentation
- Identify all dependencies and usage points
2. **Test Coverage Verification**
- Ensure comprehensive test coverage exists for the code being refactored
- If tests are missing, write them BEFORE starting refactoring
- Run all tests to establish a baseline
- Document current behavior with additional tests if needed
3. **Refactoring Strategy**
- Define clear goals for the refactoring (performance, readability, maintainability)
- Choose appropriate refactoring techniques:
- Extract Method/Function
- Extract Class/Component
- Rename Variable/Method
- Move Method/Field
- Replace Conditional with Polymorphism
- Eliminate Dead Code
- Plan the refactoring in small, incremental steps
4. **Environment Setup**
- Create a new branch: `git checkout -b refactor/$ARGUMENTS`
- Ensure all tests pass before starting
- Set up any additional tooling needed (profilers, analyzers)
5. **Incremental Refactoring**
- Make small, focused changes one at a time
- Run tests after each change to ensure nothing breaks
- Commit working changes frequently with descriptive messages
- Use IDE refactoring tools when available for safety
6. **Code Quality Improvements**
- Improve naming conventions for clarity
- Eliminate code duplication (DRY principle)
- Simplify complex conditional logic
- Reduce method/function length and complexity
- Improve separation of concerns
7. **Performance Optimizations**
- Identify and eliminate performance bottlenecks
- Optimize algorithms and data structures
- Reduce unnecessary computations
- Improve memory usage patterns
8. **Design Pattern Application**
- Apply appropriate design patterns where beneficial
- Improve abstraction and encapsulation
- Enhance modularity and reusability
- Reduce coupling between components
9. **Error Handling Improvement**
- Standardize error handling approaches
- Improve error messages and logging
- Add proper exception handling
- Enhance resilience and fault tolerance
10. **Documentation Updates**
- Update code comments to reflect changes
- Revise API documentation if interfaces changed
- Update inline documentation and examples
- Ensure comments are accurate and helpful
11. **Testing Enhancements**
- Add tests for any new code paths created
- Improve existing test quality and coverage
- Remove or update obsolete tests
- Ensure tests are still meaningful and effective
12. **Static Analysis**
- Run linting tools to catch style and potential issues
- Use static analysis tools to identify problems
- Check for security vulnerabilities
- Verify code complexity metrics
13. **Performance Verification**
- Run performance benchmarks if applicable
- Compare before/after metrics
- Ensure refactoring didn't degrade performance
- Document any performance improvements
14. **Integration Testing**
- Run full test suite to ensure no regressions
- Test integration with dependent systems
- Verify all functionality works as expected
- Test edge cases and error scenarios
15. **Code Review Preparation**
- Review all changes for quality and consistency
- Ensure refactoring goals were achieved
- Prepare clear explanation of changes made
- Document benefits and rationale
16. **Documentation of Changes**
- Create a summary of refactoring changes
- Document any breaking changes or new patterns
- Update project documentation if needed
- Explain benefits and reasoning for future reference
17. **Deployment Considerations**
- Plan deployment strategy for refactored code
- Consider feature flags for gradual rollout
- Prepare rollback procedures
- Set up monitoring for the refactored components
Remember: Refactoring should preserve external behavior while improving internal structure. Always prioritize safety over speed, and maintain comprehensive test coverage throughout the process.

View File

@@ -0,0 +1,37 @@
---
allowed-tools: Read, Write, Edit, Bash
argument-hint: [environment-type] | --development | --production | --microservices | --compose
description: Setup Docker containerization with multi-stage builds and development workflows
model: sonnet
---
# Setup Docker Containers
Setup comprehensive Docker containerization for development and production: **$ARGUMENTS**
## Current Project State
- Application type: @package.json or @requirements.txt (detect Node.js, Python, etc.)
- Existing Docker: @Dockerfile or @docker-compose.yml (if exists)
- Dependencies: !`find . -name "package-lock.json" -o -name "poetry.lock" -o -name "Pipfile.lock" | wc -l`
- Services needed: Database, cache, message queue detection from configs
## Task
Implement production-ready Docker containerization with optimized builds and development workflows:
**Environment Type**: Use $ARGUMENTS to specify development, production, microservices, or Docker Compose setup
**Containerization Strategy**:
1. **Dockerfile Creation** - Multi-stage builds, layer optimization, security best practices
2. **Development Workflow** - Hot reloading, volume mounts, debugging capabilities
3. **Production Optimization** - Image size reduction, security scanning, health checks
4. **Multi-Service Setup** - Docker Compose, service discovery, networking configuration
5. **CI/CD Integration** - Build automation, registry management, deployment pipelines
6. **Monitoring & Logs** - Container observability, log aggregation, resource monitoring
**Security Features**: Non-root users, minimal base images, vulnerability scanning, secrets management.
**Performance Optimization**: Layer caching, build contexts, multi-platform builds, and resource constraints.
**Output**: Complete Docker setup with optimized containers, development workflows, production deployment, and comprehensive documentation.

153
commands/ultra-think.md Executable file
View File

@@ -0,0 +1,153 @@
# Deep Analysis and Problem Solving Mode
Deep analysis and problem solving mode
## Instructions
1. **Initialize Ultra Think Mode**
- Acknowledge the request for enhanced analytical thinking
- Set context for deep, systematic reasoning
- Prepare to explore the problem space comprehensively
2. **Parse the Problem or Question**
- Extract the core challenge from: **$ARGUMENTS**
- Identify all stakeholders and constraints
- Recognize implicit requirements and hidden complexities
- Question assumptions and surface unknowns
3. **Multi-Dimensional Analysis**
Approach the problem from multiple angles:
### Technical Perspective
- Analyze technical feasibility and constraints
- Consider scalability, performance, and maintainability
- Evaluate security implications
- Assess technical debt and future-proofing
### Business Perspective
- Understand business value and ROI
- Consider time-to-market pressures
- Evaluate competitive advantages
- Assess risk vs. reward trade-offs
### User Perspective
- Analyze user needs and pain points
- Consider usability and accessibility
- Evaluate user experience implications
- Think about edge cases and user journeys
### System Perspective
- Consider system-wide impacts
- Analyze integration points
- Evaluate dependencies and coupling
- Think about emergent behaviors
4. **Generate Multiple Solutions**
- Brainstorm at least 3-5 different approaches
- For each approach, consider:
- Pros and cons
- Implementation complexity
- Resource requirements
- Potential risks
- Long-term implications
- Include both conventional and creative solutions
- Consider hybrid approaches
5. **Deep Dive Analysis**
For the most promising solutions:
- Create detailed implementation plans
- Identify potential pitfalls and mitigation strategies
- Consider phased approaches and MVPs
- Analyze second and third-order effects
- Think through failure modes and recovery
6. **Cross-Domain Thinking**
- Draw parallels from other industries or domains
- Apply design patterns from different contexts
- Consider biological or natural system analogies
- Look for innovative combinations of existing solutions
7. **Challenge and Refine**
- Play devil's advocate with each solution
- Identify weaknesses and blind spots
- Consider "what if" scenarios
- Stress-test assumptions
- Look for unintended consequences
8. **Synthesize Insights**
- Combine insights from all perspectives
- Identify key decision factors
- Highlight critical trade-offs
- Summarize innovative discoveries
- Present a nuanced view of the problem space
9. **Provide Structured Recommendations**
Present findings in a clear structure:
```
## Problem Analysis
- Core challenge
- Key constraints
- Critical success factors
## Solution Options
### Option 1: [Name]
- Description
- Pros/Cons
- Implementation approach
- Risk assessment
### Option 2: [Name]
[Similar structure]
## Recommendation
- Recommended approach
- Rationale
- Implementation roadmap
- Success metrics
- Risk mitigation plan
## Alternative Perspectives
- Contrarian view
- Future considerations
- Areas for further research
```
10. **Meta-Analysis**
- Reflect on the thinking process itself
- Identify areas of uncertainty
- Acknowledge biases or limitations
- Suggest additional expertise needed
- Provide confidence levels for recommendations
## Usage Examples
```bash
# Architectural decision
/project:ultra-think Should we migrate to microservices or improve our monolith?
# Complex problem solving
/project:ultra-think How do we scale our system to handle 10x traffic while reducing costs?
# Strategic planning
/project:ultra-think What technology stack should we choose for our next-gen platform?
# Design challenge
/project:ultra-think How can we improve our API to be more developer-friendly while maintaining backward compatibility?
```
## Key Principles
- **First Principles Thinking**: Break down to fundamental truths
- **Systems Thinking**: Consider interconnections and feedback loops
- **Probabilistic Thinking**: Work with uncertainties and ranges
- **Inversion**: Consider what to avoid, not just what to do
- **Second-Order Thinking**: Consider consequences of consequences
## Output Expectations
- Comprehensive analysis (typically 2-4 pages of insights)
- Multiple viable solutions with trade-offs
- Clear reasoning chains
- Acknowledgment of uncertainties
- Actionable recommendations
- Novel insights or perspectives

672
commands/update-docs.md Executable file
View File

@@ -0,0 +1,672 @@
---
allowed-tools: Read, Write, Edit, Bash, Grep, Glob, Task, mcp__*
argument-hint: [doc-type] | --generate-local | --sync-to-wiki | --regenerate | --all | --validate
description: Generate documentation locally to ./docs/ then sync to Azure DevOps wiki (local-first workflow)
model: sonnet
---
# Data Pipeline Documentation - Local-First Workflow
Generate documentation locally in `./docs/` directory, then sync to Azure DevOps wiki: $ARGUMENTS
## Architecture: Local-First Documentation
```
Source Code → Generate Docs → ./docs/ (version controlled) → Sync to Wiki
```
**Benefits:**
- ✅ Documentation version controlled in git
- ✅ Review locally before wiki publish
- ✅ No regeneration needed for wiki sync
- ✅ Git diff shows doc changes
- ✅ Reusable across multiple targets (wiki, GitHub Pages, PDF)
- ✅ Offline access to documentation
## Repository Information
- Repository: unify_2_1_dm_synapse_env_d10
- Local docs: `./docs/` (mirrors repo structure)
- Wiki base: 'Unify 2.1 Data Migration Technical Documentation'/'Data Migration Pipeline'/unify_2_1_dm_synapse_env_d10/
- Exclusions: @.docsignore (similar to .gitignore)
## Documentation Workflows
### --generate-local: Generate Documentation Locally
Generate comprehensive documentation and save to `./docs/` directory.
#### Step 1: Scan Repository for Files
```bash
# Get all documentable files (exclude .docsignore patterns)
git ls-files "*.py" "*.yaml" "*.yml" "*.md" | grep -v -f <(git ls-files --ignored --exclude-standard --exclude-from=.docsignore)
```
**Target files:**
- Python files: `python_files/**/*.py`
- Configuration: `configuration.yaml`
- Existing markdown: `README.md` (validate/enhance)
**Exclude (from .docsignore):**
- `__pycache__/`, `*.pyc`, `.venv/`
- `.claude/`, `docs/`, `*.duckdb`
- See `.docsignore` for complete list
#### Step 2: Launch Code-Documenter Agent
Use Task tool to launch code-documenter agent:
```
Generate comprehensive documentation for repository files:
**Scope:**
- Target: All Python files in python_files/ (utilities, bronze, silver, gold, testing)
- Configuration files: configuration.yaml
- Exclude: Files matching .docsignore patterns
**Documentation Requirements:**
For Python files:
- File purpose and overview
- Architecture and design patterns (medallion, ETL, etc.)
- Class and function documentation
- Data flow explanations
- Business logic descriptions
- Dependencies and imports
- Usage examples
- Testing information
- Related Azure DevOps work items
For Configuration files:
- Configuration structure
- All configuration sections explained
- Environment variables
- Azure integration settings
- Usage examples
**Output Format:**
- Markdown format suitable for wiki
- File naming: source_file.py → docs/path/source_file.py.md
- Clear heading structure
- Code examples with syntax highlighting
- Cross-references to related files
- Professional, concise language
- NO attribution footers (e.g., "Documentation By: Claude Code")
**Output Location:**
Save all generated documentation to ./docs/ directory maintaining source structure:
- python_files/utilities/session_optimiser.py → docs/python_files/utilities/session_optimiser.py.md
- python_files/gold/g_address.py → docs/python_files/gold/g_address.py.md
- configuration.yaml → docs/configuration.yaml.md
**Directory Index Files:**
Generate README.md for each directory with:
- Directory purpose
- List of files with brief descriptions
- Architecture overview for layer directories
- Navigation links
```
#### Step 3: Generate Directory Index Files
Create `README.md` files for each directory:
**Root Index (docs/README.md):**
- Overall documentation structure
- Navigation to main sections
- Medallion architecture overview
- Link to wiki
**Layer Indexes:**
- `docs/python_files/README.md` - Pipeline overview
- `docs/python_files/utilities/README.md` - Core utilities index
- `docs/python_files/bronze/README.md` - Bronze layer overview
- `docs/python_files/silver/README.md` - Silver layer overview
- `docs/python_files/silver/cms/README.md` - CMS tables index
- `docs/python_files/silver/fvms/README.md` - FVMS tables index
- `docs/python_files/silver/nicherms/README.md` - NicheRMS tables index
- `docs/python_files/gold/README.md` - Gold layer overview
- `docs/python_files/testing/README.md` - Testing documentation
#### Step 4: Validation
Verify generated documentation:
- All source files have corresponding .md files in ./docs/
- Directory structure matches source repository
- Index files (README.md) created for directories
- Markdown formatting is valid
- No files from .docsignore included
- Cross-references are valid
#### Step 5: Summary Report
Provide detailed report:
```markdown
## Documentation Generation Complete
### Files Documented:
- Python files: [count]
- Configuration files: [count]
- Total documentation files: [count]
### Directory Structure:
- Utilities: [file count]
- Bronze layer: [file count]
- Silver layer: [file count by database]
- Gold layer: [file count]
- Testing: [file count]
### Index Files Created:
- Root index: docs/README.md
- Layer indexes: [list]
- Database indexes: [list]
### Location:
All documentation saved to: ./docs/
### Next Steps:
1. Review generated documentation: `ls -R ./docs/`
2. Make any manual edits if needed
3. Commit to git: `git add docs/`
4. Sync to wiki: `/update-docs --sync-to-wiki`
```
---
### --sync-to-wiki: Sync Local Docs to Azure DevOps Wiki
Copy documentation from `./docs/` to Azure DevOps wiki (no regeneration).
#### Step 1: Scan Local Documentation
```bash
# Find all .md files in ./docs/
find ./docs -name "*.md" -type f
```
**Path Mapping Logic:**
Local path → Wiki path conversion:
```
./docs/python_files/utilities/session_optimiser.py.md
Unify 2.1 Data Migration Technical Documentation/
Data Migration Pipeline/
unify_2_1_dm_synapse_env_d10/
python_files/utilities/session_optimiser.py
```
**Mapping rules:**
1. Remove `./docs/` prefix
2. Remove `.md` extension (unless README.md → README)
3. Prepend wiki base path
4. Use forward slashes for wiki paths
#### Step 2: Read and Process Each Documentation File
For each `.md` file in `./docs/`:
1. Read markdown content
2. Extract metadata (if present)
3. Generate wiki path from local path
4. Prepare content for wiki format
5. Add footer with metadata:
```markdown
---
**Metadata:**
- Source: [file path in repo]
- Last Updated: [date]
- Related Work Items: [links if available]
```
#### Step 3: Create/Update Wiki Pages Using ADO MCP
Use Azure DevOps MCP to create or update each wiki page:
```bash
# For each documentation file:
# 1. Check if wiki page exists
# 2. Create new page if not exists
# 3. Update existing page if exists
# 4. Verify success
# Example for session_optimiser.py.md:
Local: ./docs/python_files/utilities/session_optimiser.py.md
Wiki: Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/python_files/utilities/session_optimiser.py
Action: Create/Update wiki page with content
```
**ADO MCP Operations:**
```python
# Pseudo-code for sync operation
for doc_file in find_all_docs():
wiki_path = local_to_wiki_path(doc_file)
content = read_file(doc_file)
# Use MCP to create/update
mcp__Azure_DevOps__create_or_update_wiki_page(
path=wiki_path,
content=content
)
```
#### Step 4: Verification
After sync, verify:
- All .md files from ./docs/ have corresponding wiki pages
- Wiki path structure matches local structure
- Content is properly formatted in wiki
- No sync errors
- Wiki pages accessible in Azure DevOps
#### Step 5: Summary Report
Provide detailed sync report:
```markdown
## Wiki Sync Complete
### Pages Synced:
- Total pages: [count]
- Created new: [count]
- Updated existing: [count]
### By Directory:
- Utilities: [count] pages
- Bronze: [count] pages
- Silver: [count] pages
- CMS: [count] pages
- FVMS: [count] pages
- NicheRMS: [count] pages
- Gold: [count] pages
- Testing: [count] pages
### Wiki Location:
Base: Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/
### Verification:
- All pages synced successfully: [✅/❌]
- Path structure correct: [✅/❌]
- Content formatting valid: [✅/❌]
### Errors:
[List any sync failures and reasons]
### Next Steps:
1. Verify pages in Azure DevOps wiki
2. Check navigation and cross-references
3. Share wiki URL with team
```
---
### --regenerate: Regenerate Specific File(s)
Update documentation for specific file(s) without full regeneration.
**Usage:**
```bash
# Single file
/update-docs --regenerate python_files/gold/g_address.py
# Multiple files
/update-docs --regenerate python_files/gold/g_address.py python_files/gold/g_cms_address.py
# Entire directory
/update-docs --regenerate python_files/utilities/
```
**Process:**
1. Launch code-documenter agent for specified file(s)
2. Generate updated documentation
3. Save to ./docs/ (overwrite existing)
4. Report files updated
5. Optionally sync to wiki
**Output:**
```markdown
## Documentation Regenerated
### Files Updated:
- python_files/gold/g_address.py → docs/python_files/gold/g_address.py.md
### Next Steps:
1. Review updated documentation
2. Commit changes: `git add docs/python_files/gold/g_address.py.md`
3. Sync to wiki: `/update-docs --sync-to-wiki --directory python_files/gold/`
```
---
### --all: Complete Workflow
Execute complete documentation workflow: generate local + sync to wiki.
**Process:**
1. Execute `--generate-local` workflow
2. Validate generated documentation
3. Execute `--sync-to-wiki` workflow
4. Provide comprehensive summary
**Use when:**
- Initial documentation setup
- Major refactoring or restructuring
- Adding new layers or modules
- Quarterly documentation refresh
---
### --validate: Documentation Validation
Validate documentation completeness and accuracy.
**Validation Checks:**
1. **Completeness:**
- All source files have documentation
- All directories have index files (README.md)
- No missing cross-references
2. **Accuracy:**
- Documented functions exist in source
- Schema documentation matches actual tables
- Configuration docs match configuration.yaml
3. **Quality:**
- Valid markdown syntax
- Proper heading structure
- Code blocks properly formatted
- No broken links
4. **Sync Status:**
- ./docs/ files match wiki pages
- No uncommitted documentation changes
- Wiki pages up to date
**Validation Report:**
```markdown
## Documentation Validation Results
### Completeness: [✅/❌]
- Files without docs: [count]
- Missing index files: [count]
- Missing cross-references: [count]
### Accuracy: [✅/❌]
- Schema mismatches: [count]
- Outdated function docs: [count]
- Configuration drift: [count]
### Quality: [✅/❌]
- Markdown syntax errors: [count]
- Broken links: [count]
- Formatting issues: [count]
### Sync Status: [✅/❌]
- Out-of-sync files: [count]
- Uncommitted changes: [count]
- Wiki drift: [count]
### Actions Required:
[List of fixes needed]
```
---
## Optional Workflow Modifiers
### --layer: Target Specific Layer
Generate/sync documentation for specific layer only.
```bash
/update-docs --generate-local --layer utilities
/update-docs --generate-local --layer gold
/update-docs --sync-to-wiki --layer silver
```
### --directory: Target Specific Directory
Generate/sync documentation for specific directory.
```bash
/update-docs --generate-local --directory python_files/gold/
/update-docs --sync-to-wiki --directory python_files/utilities/
```
### --only-modified: Sync Only Changed Files
Sync only files modified since last sync (based on git status).
```bash
/update-docs --sync-to-wiki --only-modified
```
**Process:**
1. Check git status for modified .md files in ./docs/
2. Sync only those files to wiki
3. Faster than full sync
---
## Code-Documenter Agent Integration
### When to Use Code-Documenter Agent:
**Always use Task tool with subagent_type="code-documenter" for:**
1. **Initial documentation generation** (--generate-local)
2. **File regeneration** (--regenerate)
3. **Complex transformations** - ETL logic, medallion patterns
4. **Architecture documentation** - High-level system design
### Agent Invocation Pattern:
```markdown
Launch code-documenter agent with:
- Target files: [list of files or directories]
- Documentation scope: comprehensive documentation
- Focus areas: [medallion architecture | ETL logic | utilities | testing]
- Output format: Wiki-ready markdown
- Output location: ./docs/ (maintain source structure)
- Exclude patterns: Files from .docsignore
- Quality requirements: Professional, accurate, no attribution footers
```
---
## Path Mapping Reference
### Local to Wiki Path Conversion
**Function logic:**
```python
def local_to_wiki_path(local_path: str) -> str:
"""
Convert local docs path to Azure DevOps wiki path
Args:
local_path: Path like ./docs/python_files/utilities/session_optimiser.py.md
Returns:
Wiki path like: Unify 2.1 Data Migration Technical Documentation/.../session_optimiser.py
"""
# Remove ./docs/ prefix
relative = local_path.replace('./docs/', '')
# Handle README.md (keep as README)
if relative.endswith('/README.md'):
relative = relative # Keep README.md
elif relative.endswith('.md'):
relative = relative[:-3] # Remove .md extension
# Build wiki path
wiki_base = "Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10"
wiki_path = f"{wiki_base}/{relative}"
return wiki_path
```
**Examples:**
```
./docs/README.md
→ Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/README
./docs/python_files/utilities/session_optimiser.py.md
→ Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/python_files/utilities/session_optimiser.py
./docs/python_files/gold/g_address.py.md
→ Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/python_files/gold/g_address.py
./docs/configuration.yaml.md
→ Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/configuration.yaml
```
---
## Azure DevOps MCP Commands
### Wiki Operations:
```bash
# Create wiki page
mcp__Azure_DevOps__create_wiki_page(
path="Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/python_files/utilities/session_optimiser.py",
content="[markdown content]"
)
# Update wiki page
mcp__Azure_DevOps__update_wiki_page(
path="[wiki page path]",
content="[updated markdown content]"
)
# List wiki pages in directory
mcp__Azure_DevOps__list_wiki_pages(
path="Unify 2.1 Data Migration Technical Documentation/Data Migration Pipeline/unify_2_1_dm_synapse_env_d10/python_files/gold"
)
# Delete wiki page (cleanup)
mcp__Azure_DevOps__delete_wiki_page(
path="[wiki page path]"
)
```
---
## Guidelines
### DO:
- ✅ Generate documentation locally first (./docs/)
- ✅ Review and edit documentation before wiki sync
- ✅ Commit documentation to git with code changes
- ✅ Use code-documenter agent for comprehensive docs
- ✅ Respect .docsignore patterns
- ✅ Maintain directory structure matching source repo
- ✅ Generate index files (README.md) for directories
- ✅ Use --only-modified for incremental wiki updates
- ✅ Validate documentation regularly
- ✅ Link to Azure DevOps work items in docs
### DO NOT:
- ❌ Generate documentation directly to wiki (bypass ./docs/)
- ❌ Skip local review before wiki publish
- ❌ Document files in .docsignore (__pycache__/, *.pyc, .env)
- ❌ Include attribution footers ("Documentation By: Claude Code")
- ❌ Duplicate documentation in multiple locations
- ❌ Create wiki pages without proper path structure
- ❌ Forget to update documentation when code changes
- ❌ Sync to wiki without validating locally first
---
## Documentation Quality Standards
### For Python Files:
- Clear file purpose and overview
- Architecture and design pattern explanations
- Class and function documentation with type hints
- Data flow diagrams for ETL transformations
- Business logic explanations
- Usage examples with code snippets
- Testing information and coverage
- Dependencies and related files
- Related Azure DevOps work items
### For Configuration Files:
- Section-by-section explanation
- Environment variable documentation
- Azure integration details
- Usage examples
- Valid value ranges and constraints
### For Index Files (README.md):
- Directory purpose and overview
- File listing with brief descriptions
- Architecture context (for layers)
- Navigation links to sub-sections
- Key concepts and patterns
### Markdown Quality:
- Clear heading hierarchy (H1 → H2 → H3)
- Code blocks with language specification
- Tables for structured data
- Cross-references using relative links
- No broken links
- Professional, concise language
- Valid markdown syntax
---
## Git Integration
### Commit Documentation with Code:
```bash
# Add both code and documentation
git add python_files/gold/g_address.py docs/python_files/gold/g_address.py.md
git commit -m "feat(gold): add g_address table with documentation"
# View documentation changes
git diff docs/
# Documentation visible in PR reviews
```
### Pre-commit Hook (Optional):
```bash
# Validate documentation before commit
# In .git/hooks/pre-commit:
/update-docs --validate
```
---
## Output Summary Template
After any workflow completion, provide:
### 1. Workflow Executed:
- Command: [command used]
- Scope: [what was processed]
- Duration: [time taken]
### 2. Documentation Generated/Updated:
- Files processed: [count and list]
- Location: ./docs/
- Size: [total documentation size]
### 3. Wiki Sync Results (if applicable):
- Pages created: [count]
- Pages updated: [count]
- Wiki path: [base path]
- Status: [success/partial/failed]
### 4. Validation Results:
- Completeness: [✅/❌]
- Accuracy: [✅/❌]
- Quality: [✅/❌]
- Issues found: [count and details]
### 5. Next Steps:
- Recommended actions
- Areas needing attention
- Suggested improvements

326
commands/write-tests.md Executable file
View File

@@ -0,0 +1,326 @@
---
allowed-tools: Read, Write, Edit, Bash
argument-hint: [target-file] | [test-type] | --unit | --integration | --data-validation | --medallion
description: Write comprehensive pytest tests for PySpark data pipelines with live data validation
model: sonnet
---
# Write Tests - pytest + PySpark with Live Data
Write comprehensive pytest tests for PySpark data pipelines using **LIVE DATA** sources: **$ARGUMENTS**
## Current Testing Context
- Test framework: !`[ -f pytest.ini ] && echo "pytest configured" || echo "pytest setup needed"`
- Target: $ARGUMENTS (file/layer to test)
- Test location: !`ls -d tests/ test/ 2>/dev/null | head -1 || echo "tests/ (will create)"`
- Live data available: Bronze/Silver/Gold layers with real FVMS, CMS, NicheRMS tables
## Core Principle: TEST WITH LIVE DATA
**ALWAYS use real data from Bronze/Silver/Gold layers**. No mocked data unless absolutely necessary.
## pytest Testing Framework
### 1. Test File Organization
```
tests/
├── conftest.py # Shared fixtures (Spark session, live data)
├── test_bronze_ingestion.py # Bronze layer validation
├── test_silver_transformations.py # Silver layer ETL
├── test_gold_aggregations.py # Gold layer analytics
├── test_utilities.py # TableUtilities, NotebookLogger
└── integration/
└── test_end_to_end_pipeline.py
```
### 2. Essential pytest Fixtures (conftest.py)
```python
import pytest
from pyspark.sql import SparkSession
from python_files.utilities.session_optimiser import SparkOptimiser
@pytest.fixture(scope="session")
def spark():
"""Shared Spark session for all tests - reuses SparkOptimiser"""
session = SparkOptimiser.get_optimised_spark_session()
yield session
session.stop()
@pytest.fixture(scope="session")
def bronze_data(spark):
"""Live bronze layer data - REAL DATA"""
return spark.table("bronze_fvms.b_vehicle_master")
@pytest.fixture(scope="session")
def silver_data(spark):
"""Live silver layer data - REAL DATA"""
return spark.table("silver_fvms.s_vehicle_master")
@pytest.fixture
def sample_live_data(bronze_data):
"""Small sample from live data for fast tests"""
return bronze_data.limit(100)
```
### 3. pytest Test Patterns
#### Pattern 1: Unit Tests (Individual Functions)
```python
# tests/test_utilities.py
import pytest
from python_files.utilities.session_optimiser import TableUtilities
class TestTableUtilities:
def test_add_row_hash_creates_hash_column(self, spark, sample_live_data):
"""Verify add_row_hash() creates hash_key column"""
result = TableUtilities.add_row_hash(sample_live_data, ["vehicle_id"])
assert "hash_key" in result.columns
assert result.count() == sample_live_data.count()
def test_drop_duplicates_simple_removes_exact_duplicates(self, spark):
"""Test deduplication on live data"""
# Use LIVE data with known duplicates
raw_data = spark.table("bronze_fvms.b_vehicle_events")
result = TableUtilities.drop_duplicates_simple(raw_data)
assert result.count() <= raw_data.count()
@pytest.mark.parametrize("date_col", ["created_date", "updated_date", "event_date"])
def test_clean_date_time_columns_handles_all_formats(self, spark, bronze_data, date_col):
"""Parameterized test for date cleaning"""
if date_col in bronze_data.columns:
result = TableUtilities.clean_date_time_columns(bronze_data, [date_col])
assert date_col in result.columns
```
#### Pattern 2: Integration Tests (End-to-End)
```python
# tests/integration/test_end_to_end_pipeline.py
import pytest
from python_files.silver.fvms.s_vehicle_master import VehicleMaster
class TestSilverVehicleMasterPipeline:
def test_full_etl_with_live_bronze_data(self, spark):
"""Test complete Bronze → Silver transformation with LIVE data"""
# Extract: Read LIVE bronze data
bronze_table = "bronze_fvms.b_vehicle_master"
bronze_df = spark.table(bronze_table)
initial_count = bronze_df.count()
# Transform & Load: Run actual ETL class
etl = VehicleMaster(bronze_table_name=bronze_table)
# Validate: Check LIVE silver output
silver_df = spark.table("silver_fvms.s_vehicle_master")
assert silver_df.count() > 0
assert "hash_key" in silver_df.columns
assert "load_timestamp" in silver_df.columns
# Data quality: No nulls in critical fields
assert silver_df.filter("vehicle_id IS NULL").count() == 0
```
#### Pattern 3: Data Validation (Live Data Checks)
```python
# tests/test_data_validation.py
import pytest
class TestBronzeLayerDataQuality:
"""Validate live data quality in Bronze layer"""
def test_bronze_vehicle_master_has_recent_data(self, spark):
"""Verify bronze layer contains recent records"""
from pyspark.sql.functions import max, datediff, current_date
df = spark.table("bronze_fvms.b_vehicle_master")
max_date = df.select(max("load_timestamp")).collect()[0][0]
# Data should be less than 30 days old
assert (current_date() - max_date).days <= 30
def test_bronze_to_silver_row_counts_match_expectations(self, spark):
"""Validate row count transformation logic"""
bronze = spark.table("bronze_fvms.b_vehicle_master")
silver = spark.table("silver_fvms.s_vehicle_master")
# After deduplication, silver <= bronze
assert silver.count() <= bronze.count()
@pytest.mark.slow
def test_hash_key_uniqueness_on_live_data(self, spark):
"""Verify hash_key uniqueness in Silver layer (full scan)"""
df = spark.table("silver_fvms.s_vehicle_master")
total = df.count()
unique = df.select("hash_key").distinct().count()
assert total == unique, f"Duplicate hash_keys found: {total - unique}"
```
#### Pattern 4: Schema Validation
```python
# tests/test_schema_validation.py
import pytest
from pyspark.sql.types import StringType, IntegerType, TimestampType
class TestSchemaConformance:
def test_silver_vehicle_schema_matches_expected(self, spark):
"""Validate Silver layer schema against business requirements"""
df = spark.table("silver_fvms.s_vehicle_master")
schema_dict = {field.name: field.dataType for field in df.schema.fields}
# Critical fields must exist
assert "vehicle_id" in schema_dict
assert "hash_key" in schema_dict
assert "load_timestamp" in schema_dict
# Type validation
assert isinstance(schema_dict["vehicle_id"], StringType)
assert isinstance(schema_dict["load_timestamp"], TimestampType)
```
### 4. pytest Markers & Configuration
**pytest.ini**:
```ini
[tool:pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
markers =
slow: marks tests as slow (deselect with '-m "not slow"')
integration: marks tests as integration tests
unit: marks tests as unit tests
live_data: tests that require live data access
addopts =
-v
--tb=short
--strict-markers
--disable-warnings
```
**Run specific test types**:
```bash
pytest tests/test_utilities.py -v # Single file
pytest -m unit # Only unit tests
pytest -m "not slow" # Skip slow tests
pytest -k "vehicle" # Tests matching "vehicle"
pytest --maxfail=1 # Stop on first failure
pytest -n auto # Parallel execution (pytest-xdist)
```
### 5. Advanced pytest Features
#### Parametrized Tests
```python
@pytest.mark.parametrize("table_name,expected_min_count", [
("bronze_fvms.b_vehicle_master", 1000),
("bronze_cms.b_customer_master", 500),
("bronze_nicherms.b_booking_master", 2000),
])
def test_bronze_tables_have_minimum_rows(spark, table_name, expected_min_count):
"""Validate minimum row counts across multiple live tables"""
df = spark.table(table_name)
assert df.count() >= expected_min_count
```
#### Fixtures with Live Data Sampling
```python
@pytest.fixture
def stratified_sample(bronze_data):
"""Stratified sample from live data for statistical tests"""
from pyspark.sql.functions import col
return bronze_data.sampleBy("vehicle_type", fractions={"Car": 0.1, "Truck": 0.1})
```
### 6. Testing Best Practices
**DO**:
- ✅ Use `spark.table()` to read LIVE Bronze/Silver/Gold data
- ✅ Test with `.limit(100)` for speed, full dataset for validation
- ✅ Use `@pytest.fixture(scope="session")` for Spark session (reuse)
- ✅ Test actual ETL classes (e.g., `VehicleMaster()`)
- ✅ Validate data quality (nulls, duplicates, date ranges)
- ✅ Use `pytest.mark.parametrize` for testing multiple tables
- ✅ Clean up test outputs in teardown fixtures
**DON'T**:
- ❌ Create mock/fake data (use real data samples)
- ❌ Skip testing because "data is too large" (use `.limit()`)
- ❌ Write tests that modify production tables
- ❌ Ignore schema validation
- ❌ Forget to test error handling with real edge cases
### 7. Example: Complete Test File
```python
# tests/test_silver_vehicle_master.py
import pytest
from pyspark.sql.functions import col, count, when
from python_files.silver.fvms.s_vehicle_master import VehicleMaster
class TestSilverVehicleMaster:
"""Test Silver layer VehicleMaster ETL with LIVE data"""
@pytest.fixture(scope="class")
def silver_df(self, spark):
"""Live Silver data - computed once per test class"""
return spark.table("silver_fvms.s_vehicle_master")
def test_all_required_columns_exist(self, silver_df):
"""Validate schema completeness"""
required = ["vehicle_id", "hash_key", "load_timestamp", "registration_number"]
missing = [col for col in required if col not in silver_df.columns]
assert not missing, f"Missing columns: {missing}"
def test_no_nulls_in_primary_key(self, silver_df):
"""Primary key cannot be null"""
null_count = silver_df.filter(col("vehicle_id").isNull()).count()
assert null_count == 0
def test_hash_key_generated_for_all_rows(self, silver_df):
"""Every row must have hash_key"""
total = silver_df.count()
with_hash = silver_df.filter(col("hash_key").isNotNull()).count()
assert total == with_hash
@pytest.mark.slow
def test_deduplication_effectiveness(self, spark):
"""Compare Bronze vs Silver row counts"""
bronze = spark.table("bronze_fvms.b_vehicle_master")
silver = spark.table("silver_fvms.s_vehicle_master")
bronze_count = bronze.count()
silver_count = silver.count()
dedup_rate = (bronze_count - silver_count) / bronze_count * 100
print(f"Deduplication removed {dedup_rate:.2f}% of rows")
assert silver_count <= bronze_count
```
## Execution Workflow
1. **Read target file** ($ARGUMENTS) - Understand transformation logic
2. **Identify live data sources** - Find Bronze/Silver tables used
3. **Create test file** - `tests/test_<target>.py`
4. **Write fixtures** - Setup Spark session, load live data samples
5. **Write unit tests** - Test individual utility functions
6. **Write integration tests** - Test full ETL with live data
7. **Write validation tests** - Check data quality on live tables
8. **Run tests**: `pytest tests/test_<target>.py -v`
9. **Verify coverage**: Ensure >80% coverage of transformation logic
## Output Deliverables
- ✅ pytest test file with 10+ test cases
- ✅ conftest.py with reusable fixtures
- ✅ pytest.ini configuration
- ✅ Tests use LIVE data from Bronze/Silver/Gold
- ✅ All tests pass: `pytest -v`
- ✅ Documentation comments showing live data usage