Initial commit
This commit is contained in:
396
skills/auto-code-review-gate.md
Executable file
396
skills/auto-code-review-gate.md
Executable file
@@ -0,0 +1,396 @@
|
||||
# Auto Code Review Gate Skill
|
||||
|
||||
## Skill Purpose
|
||||
Automatically run comprehensive code reviews before any PR-related commands (`/pr-*`) and ensure all identified issues are resolved before allowing commits to be pushed. This acts as a quality gate to prevent low-quality code from entering the staging/develop branches.
|
||||
|
||||
## Activation
|
||||
This skill is automatically triggered when any of these commands are called:
|
||||
- `/pr-feature-to-staging`
|
||||
- `/pr-deploy-workflow`
|
||||
- `/commit-and-pr`
|
||||
- `/pr-fix-pr-review`
|
||||
- Any other command starting with `/pr-`
|
||||
|
||||
## Workflow
|
||||
|
||||
### Phase 1: Pre-Commit Code Review
|
||||
|
||||
When a `/pr-*` command is detected:
|
||||
|
||||
1. **Intercept the command** - Don't execute the PR command yet
|
||||
2. **Display notice to user**:
|
||||
```
|
||||
🔍 AUTO CODE REVIEW GATE ACTIVATED
|
||||
Running comprehensive code review before proceeding with PR...
|
||||
This ensures code quality standards are met before merge.
|
||||
```
|
||||
|
||||
3. **Execute code review**:
|
||||
```bash
|
||||
/code-review
|
||||
```
|
||||
|
||||
4. **Analyze review results**:
|
||||
- Count total issues by severity (Critical, High, Medium, Low)
|
||||
- Create issue summary report
|
||||
- Determine if auto-fix is possible
|
||||
|
||||
### Phase 2: Issue Resolution
|
||||
|
||||
#### If NO issues found:
|
||||
```
|
||||
✅ CODE REVIEW PASSED
|
||||
No issues detected. Proceeding with original command...
|
||||
```
|
||||
→ Execute the original `/pr-*` command
|
||||
|
||||
#### If issues found (Critical or High priority):
|
||||
```
|
||||
❌ CODE REVIEW FAILED - BLOCKING ISSUES FOUND
|
||||
Found X critical and Y high-priority issues that must be fixed.
|
||||
|
||||
BLOCKING ISSUES:
|
||||
- [List of critical issues with file:line]
|
||||
- [List of high-priority issues with file:line]
|
||||
|
||||
🔧 AUTOMATIC FIX PROCESS INITIATED
|
||||
Launching pyspark-data-engineer agent to resolve issues...
|
||||
```
|
||||
|
||||
**Auto-Fix Workflow**:
|
||||
|
||||
1. **Create task document** (if not already exists):
|
||||
- Location: `.claude/tasks/pre_commit_code_review_fixes.md`
|
||||
- Format: Same as code review fixes task list
|
||||
- Include all critical and high-priority issues
|
||||
|
||||
2. **Launch pyspark-data-engineer agent**:
|
||||
```
|
||||
Task: Fix all critical and high-priority issues before PR
|
||||
Document: .claude/tasks/pre_commit_code_review_fixes.md
|
||||
Validation: Run syntax check, linting, and formatting after each fix
|
||||
```
|
||||
|
||||
3. **Wait for agent completion** and verify:
|
||||
- All critical issues resolved
|
||||
- All high-priority issues resolved
|
||||
- Syntax validation passes
|
||||
- Linting passes
|
||||
- No new issues introduced
|
||||
|
||||
4. **Re-run code review** to confirm all issues resolved
|
||||
|
||||
5. **Final decision**:
|
||||
- ✅ If all issues fixed: Proceed with original command
|
||||
- ❌ If issues remain: Block PR and display unresolved issues
|
||||
|
||||
#### If only Medium/Low priority issues:
|
||||
```
|
||||
⚠️ CODE REVIEW WARNING - NON-BLOCKING ISSUES FOUND
|
||||
Found X medium and Y low-priority issues.
|
||||
|
||||
These won't block the PR but should be addressed soon.
|
||||
```
|
||||
|
||||
**User Choice**:
|
||||
```
|
||||
Do you want to:
|
||||
1. Auto-fix these issues before proceeding (recommended)
|
||||
2. Proceed with PR and create tech debt ticket
|
||||
3. Cancel and fix manually
|
||||
|
||||
Choice [1/2/3]:
|
||||
```
|
||||
|
||||
### Phase 3: Post-Fix Validation
|
||||
|
||||
After auto-fix completes:
|
||||
|
||||
1. **Run validation suite**:
|
||||
```bash
|
||||
python3 -m py_compile <modified_files>
|
||||
ruff check python_files/
|
||||
ruff format python_files/
|
||||
```
|
||||
|
||||
2. **Run second code review**:
|
||||
- Ensure no new issues introduced
|
||||
- Verify all original issues resolved
|
||||
- Check for any regressions
|
||||
|
||||
3. **Generate fix summary**:
|
||||
```
|
||||
📊 AUTO-FIX SUMMARY
|
||||
==================
|
||||
Files Modified: 4
|
||||
Issues Fixed: 9 (3 critical, 4 high, 2 medium)
|
||||
Validation: ✅ All checks passed
|
||||
|
||||
Modified Files:
|
||||
- python_files/gold/g_z_mg_occ_person_address.py
|
||||
- python_files/gold/g_xa_mg_statsclasscount.py
|
||||
- python_files/silver/silver_cms/s_cms_person.py
|
||||
- python_files/gold/g_xa_mg_cms_mo.py
|
||||
|
||||
✅ All issues resolved. Proceeding with PR...
|
||||
```
|
||||
|
||||
### Phase 4: Execute Original Command
|
||||
|
||||
Only after ALL critical/high issues are resolved:
|
||||
|
||||
1. **Add fixed files to git staging**:
|
||||
```bash
|
||||
git add <modified_files>
|
||||
```
|
||||
|
||||
2. **Create enhanced commit message**:
|
||||
```
|
||||
[Original commit message]
|
||||
|
||||
🤖 Auto Code Review Fixes Applied:
|
||||
- Fixed X critical issues
|
||||
- Fixed Y high-priority issues
|
||||
- All validation checks passed
|
||||
```
|
||||
|
||||
3. **Execute original `/pr-*` command**
|
||||
|
||||
4. **Display completion message**:
|
||||
```
|
||||
✅ PR CREATED WITH AUTO-FIXES
|
||||
All code quality issues have been resolved.
|
||||
PR is ready for human review.
|
||||
|
||||
Code Review Report: .claude/tasks/pre_commit_code_review_fixes.md
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Severity Thresholds
|
||||
|
||||
```yaml
|
||||
# .claude/config/code_review_gate.yaml
|
||||
blocking_severities:
|
||||
- CRITICAL
|
||||
- HIGH
|
||||
|
||||
auto_fix_enabled: true
|
||||
auto_fix_medium_issues: true # Prompt user for medium issues
|
||||
auto_fix_low_issues: false # Skip low-priority auto-fix
|
||||
|
||||
max_auto_fix_attempts: 2
|
||||
validation_required: true
|
||||
```
|
||||
|
||||
### Bypass Options
|
||||
|
||||
**Emergency Override** (use with caution):
|
||||
```bash
|
||||
# Skip code review gate (requires explicit confirmation)
|
||||
/pr-feature-to-staging --skip-review-gate --confirm-override
|
||||
|
||||
# This will prompt:
|
||||
⚠️ DANGER: Skipping code review gate
|
||||
This may introduce bugs or technical debt.
|
||||
Type 'I UNDERSTAND THE RISKS' to proceed:
|
||||
```
|
||||
|
||||
## Implementation Hooks
|
||||
|
||||
### Hook 1: Command Interceptor
|
||||
```python
|
||||
# Intercepts all /pr-* commands
|
||||
if command.startswith("/pr-"):
|
||||
# Trigger auto-code-review-gate skill
|
||||
execute_skill("auto-code-review-gate")
|
||||
```
|
||||
|
||||
### Hook 2: Issue Detection
|
||||
```python
|
||||
# Parse code review output
|
||||
issues = parse_code_review_output(review_result)
|
||||
critical_count = count_by_severity(issues, "CRITICAL")
|
||||
high_count = count_by_severity(issues, "HIGH")
|
||||
|
||||
if critical_count > 0 or high_count > 0:
|
||||
block_pr = True
|
||||
attempt_auto_fix = True
|
||||
```
|
||||
|
||||
### Hook 3: Auto-Fix Delegation
|
||||
```python
|
||||
# Create task document and delegate to pyspark-data-engineer
|
||||
task_doc = create_task_document(issues)
|
||||
agent_result = launch_agent("pyspark-data-engineer", task_doc)
|
||||
|
||||
# Validate fixes
|
||||
validation_passed = run_validation_suite()
|
||||
issues_resolved = verify_issues_fixed(issues, agent_result)
|
||||
|
||||
if validation_passed and issues_resolved:
|
||||
allow_pr = True
|
||||
```
|
||||
|
||||
## Example Execution Flow
|
||||
|
||||
### Scenario: User runs `/pr-feature-to-staging`
|
||||
|
||||
```
|
||||
USER: /pr-feature-to-staging "feat: add new statsclasscount table"
|
||||
|
||||
SYSTEM:
|
||||
🔍 AUTO CODE REVIEW GATE ACTIVATED
|
||||
Running comprehensive code review before proceeding with PR...
|
||||
|
||||
[Code review executes...]
|
||||
|
||||
SYSTEM:
|
||||
❌ CODE REVIEW FAILED - 3 CRITICAL ISSUES FOUND
|
||||
|
||||
CRITICAL ISSUES:
|
||||
1. python_files/gold/g_z_mg_occ_person_address.py:43
|
||||
- Redundant Spark session initialization (memory leak risk)
|
||||
|
||||
2. python_files/gold/g_xa_mg_statsclasscount.py:100
|
||||
- Validation methods defined but never called (data quality risk)
|
||||
|
||||
3. python_files/gold/g_z_mg_occ_person_address.py:32
|
||||
- Unused constructor parameter (confusing API)
|
||||
|
||||
🔧 AUTOMATIC FIX PROCESS INITIATED
|
||||
Launching pyspark-data-engineer agent...
|
||||
|
||||
[Agent fixes all issues...]
|
||||
|
||||
SYSTEM:
|
||||
📊 AUTO-FIX SUMMARY
|
||||
==================
|
||||
Files Modified: 2
|
||||
Issues Fixed: 3 (3 critical)
|
||||
Validation: ✅ All checks passed
|
||||
|
||||
✅ All critical issues resolved.
|
||||
|
||||
Adding fixed files to commit:
|
||||
M python_files/gold/g_z_mg_occ_person_address.py
|
||||
M python_files/gold/g_xa_mg_statsclasscount.py
|
||||
|
||||
Proceeding with PR creation...
|
||||
|
||||
[Original /pr-feature-to-staging command executes]
|
||||
|
||||
SYSTEM:
|
||||
✅ PR CREATED SUCCESSFULLY
|
||||
Branch: feature/statsclasscount → staging
|
||||
PR #: 5830
|
||||
Status: Ready for review
|
||||
|
||||
All code quality gates passed! 🎉
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### If auto-fix fails:
|
||||
```
|
||||
❌ AUTO-FIX FAILED
|
||||
The pyspark-data-engineer agent was unable to resolve all issues.
|
||||
|
||||
Remaining Issues:
|
||||
- [List of unresolved issues]
|
||||
|
||||
NEXT STEPS:
|
||||
1. Review the task document: .claude/tasks/pre_commit_code_review_fixes.md
|
||||
2. Fix issues manually
|
||||
3. Re-run /pr-feature-to-staging when ready
|
||||
|
||||
OR
|
||||
|
||||
Use emergency override (not recommended):
|
||||
/pr-feature-to-staging --skip-review-gate --confirm-override
|
||||
```
|
||||
|
||||
### If validation fails after fix:
|
||||
```
|
||||
❌ VALIDATION FAILED AFTER AUTO-FIX
|
||||
The fixes introduced new issues or broke existing functionality.
|
||||
|
||||
Validation Errors:
|
||||
- [List of validation errors]
|
||||
|
||||
Rolling back auto-fixes...
|
||||
Original code restored.
|
||||
|
||||
NEXT STEPS:
|
||||
1. Review the code review report
|
||||
2. Fix issues manually with more care
|
||||
3. Test thoroughly before re-running PR command
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Prevents bugs before merge**: Catches issues at commit time, not in production
|
||||
2. **Automated quality gates**: No manual intervention needed for common issues
|
||||
3. **Consistent code quality**: All PRs meet minimum quality standards
|
||||
4. **Faster review cycles**: Human reviewers see clean code
|
||||
5. **Learning tool**: Developers see fixes and learn patterns
|
||||
6. **Tech debt prevention**: Issues fixed immediately, not deferred
|
||||
|
||||
## Metrics Tracked
|
||||
|
||||
The skill automatically logs:
|
||||
- Number of PRs with code review issues
|
||||
- Issues caught per severity level
|
||||
- Auto-fix success rate
|
||||
- Time saved by automated fixes
|
||||
- Common issue patterns
|
||||
|
||||
Stored in: `.claude/metrics/code_review_gate_stats.json`
|
||||
|
||||
## Integration with Existing Workflows
|
||||
|
||||
This skill works seamlessly with:
|
||||
- `/pr-feature-to-staging` - Adds quality gate before PR creation
|
||||
- `/pr-deploy-workflow` - Ensures clean code through entire deployment pipeline
|
||||
- `/commit-and-pr` - Quick commits still get quality checks
|
||||
- `/pr-fix-pr-review` - Prevents re-introducing issues when fixing review feedback
|
||||
|
||||
## Testing the Skill
|
||||
|
||||
To test the auto code review gate:
|
||||
|
||||
```bash
|
||||
# 1. Make some intentional code quality issues
|
||||
echo "import os\nimport os" >> test_file.py # Duplicate import
|
||||
|
||||
# 2. Try to create PR
|
||||
/pr-feature-to-staging "test auto review gate"
|
||||
|
||||
# 3. Verify gate catches issues and auto-fixes them
|
||||
|
||||
# 4. Confirm PR only proceeds after fixes applied
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
Update the skill when:
|
||||
- New code quality rules are added
|
||||
- Project standards change
|
||||
- New file types need review
|
||||
- Additional validation checks needed
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements:
|
||||
1. **AI-powered issue prioritization**: Use ML to determine which issues are most critical
|
||||
2. **Team notification**: Slack/Teams alerts when auto-fixes are applied
|
||||
3. **Fix explanation**: Include detailed explanations of each fix for learning
|
||||
4. **Custom rule sets**: Project-specific or team-specific quality gates
|
||||
5. **Performance metrics**: Track build times and code quality trends
|
||||
|
||||
---
|
||||
|
||||
**Status**: Active
|
||||
**Version**: 1.0
|
||||
**Last Updated**: 2025-11-04
|
||||
**Owner**: DevOps/Quality Team
|
||||
208
skills/azure-devops.md
Normal file
208
skills/azure-devops.md
Normal file
@@ -0,0 +1,208 @@
|
||||
---
|
||||
name: azure-devops
|
||||
description: On-demand Azure DevOps operations (PRs, work items, pipelines, repos) using context-efficient patterns. Loaded only when needed to avoid polluting Claude context with 50+ MCP tools. (project, gitignored)
|
||||
---
|
||||
|
||||
# Azure DevOps (On-Demand)
|
||||
|
||||
Context-efficient Azure DevOps operations without loading all MCP tools into context.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Load this skill when you need to:
|
||||
- Query pull request details, conflicts, or discussion threads
|
||||
- Check merge status or retrieve PR commits
|
||||
- Add comments to Azure DevOps work items
|
||||
- Query work item details or WIQL searches
|
||||
- Trigger or monitor pipeline runs
|
||||
- Manage repository branches or commits
|
||||
- Avoid loading 50+ MCP tools into Claude's context
|
||||
|
||||
## Core Concept
|
||||
|
||||
Use REST API helpers and Python scripts to interact with Azure DevOps only when needed. Results are filtered before returning to context.
|
||||
|
||||
**Context Efficiency**:
|
||||
- **Without this approach**: Loading ADO MCP server → 50+ tools → 10,000-25,000 tokens
|
||||
- **With this approach**: Load specific helper when needed → 500-2,000 tokens
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Environment variables must be set:
|
||||
```bash
|
||||
export AZURE_DEVOPS_PAT="your-personal-access-token"
|
||||
export AZURE_DEVOPS_ORGANIZATION="emstas"
|
||||
export AZURE_DEVOPS_PROJECT="Program Unify"
|
||||
```
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Pull Request Operations
|
||||
|
||||
```python
|
||||
from scripts.ado_pr_helper import ADOHelper
|
||||
|
||||
ado = ADOHelper()
|
||||
|
||||
# Get PR details
|
||||
pr = ado.get_pr(5860)
|
||||
print(pr["title"])
|
||||
print(pr["mergeStatus"])
|
||||
|
||||
# Check for merge conflicts
|
||||
conflicts = ado.get_pr_conflicts(5860)
|
||||
if conflicts.get("value"):
|
||||
print(f"Found {len(conflicts['value'])} conflicts")
|
||||
|
||||
# Get PR discussion threads
|
||||
threads = ado.get_pr_threads(5860)
|
||||
|
||||
# Get PR commits
|
||||
commits = ado.get_pr_commits(5860)
|
||||
```
|
||||
|
||||
### CLI Usage
|
||||
|
||||
```bash
|
||||
# Get PR details and check conflicts
|
||||
python3 /workspaces/unify_2_1_dm_synapse_env_d10/.claude/skills/mcp-code-execution/scripts/ado_pr_helper.py 5860
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Review and Fix PR Conflicts
|
||||
|
||||
```python
|
||||
# 1. Get PR details and conflicts
|
||||
ado = ADOHelper()
|
||||
pr = ado.get_pr(pr_id)
|
||||
conflicts = ado.get_pr_conflicts(pr_id)
|
||||
|
||||
# 2. Filter to only conflict info (don't load full PR data)
|
||||
conflict_files = [c["conflictPath"] for c in conflicts.get("value", [])]
|
||||
|
||||
# 3. Return summary to context
|
||||
print(f"PR {pr_id}: {pr['mergeStatus']}")
|
||||
print(f"Conflicts in: {', '.join(conflict_files)}")
|
||||
```
|
||||
|
||||
### Integration with Git Commands
|
||||
|
||||
This skill complements the git-manager agent and slash commands:
|
||||
- `/pr-feature-to-staging` - Uses ADO API to create PR and comment on work items
|
||||
- `/pr-fix-pr-review [PR_ID]` - Retrieves review comments via ADO API
|
||||
- `/pr-deploy-workflow` - Queries PR status during deployment
|
||||
- `/branch-cleanup` - Checks remote branch merge status
|
||||
|
||||
## Repository Configuration
|
||||
|
||||
**Organization**: emstas
|
||||
**Project**: Program Unify
|
||||
**Repository**: unify_2_1_dm_synapse_env_d10
|
||||
**Repository ID**: e030ea00-2f85-4b19-88c3-05a864d7298d
|
||||
|
||||
## Extending Functionality
|
||||
|
||||
To add more ADO operations:
|
||||
1. Add methods to `ado_pr_helper.py` or create new helper files
|
||||
2. Follow the pattern: fetch → filter → return summary
|
||||
3. Use REST API directly for maximum efficiency
|
||||
4. Document new operations in the skill directory
|
||||
|
||||
## REST API Reference
|
||||
|
||||
**Base URL**: `https://dev.azure.com/{organization}/{project}/_apis/`
|
||||
**API Version**: `7.1`
|
||||
**Authentication**: Basic auth with PAT
|
||||
**Documentation**: https://learn.microsoft.com/en-us/rest/api/azure/devops/
|
||||
|
||||
## Skill Directory Structure
|
||||
|
||||
For detailed documentation, see:
|
||||
- `azure-devops/skill.md` - Complete skill documentation
|
||||
- `azure-devops/scripts/` - Helper scripts (ado_pr_helper.py)
|
||||
- `azure-devops/README.md` - Quick start guide (future)
|
||||
- `azure-devops/INDEX.md` - Navigation guide (future)
|
||||
|
||||
## Best Practices
|
||||
|
||||
### DO
|
||||
- ✅ Use this skill to avoid loading MCP server tools
|
||||
- ✅ Filter results before returning to context
|
||||
- ✅ Return summaries instead of full data structures
|
||||
- ✅ Use helper scripts for common operations
|
||||
- ✅ Cache results when making multiple calls
|
||||
|
||||
### DON'T
|
||||
- ❌ Load MCP server if only querying 1-2 PRs
|
||||
- ❌ Return full JSON responses to context
|
||||
- ❌ Make redundant API calls
|
||||
- ❌ Expose PAT tokens in logs or responses
|
||||
|
||||
## Integration Points
|
||||
|
||||
### With Git Manager Agent
|
||||
- PR creation and status checking
|
||||
- Review comment retrieval
|
||||
- Work item commenting
|
||||
- Branch merge status
|
||||
|
||||
### With Deployment Workflows
|
||||
- Pipeline trigger and monitoring
|
||||
- PR validation before merge
|
||||
- Work item state updates
|
||||
- Commit linking
|
||||
|
||||
### With Documentation
|
||||
- Wiki page management (future)
|
||||
- Markdown documentation sync (future)
|
||||
- Work item documentation links
|
||||
|
||||
## Performance
|
||||
|
||||
**API Call Timing**:
|
||||
- Single PR query: ~200-500ms
|
||||
- PR with conflicts: ~300-700ms
|
||||
- PR threads retrieval: ~400-1000ms
|
||||
- Work item query: ~100-300ms
|
||||
|
||||
**Rate Limits**:
|
||||
- Azure DevOps API: 200 requests per minute per PAT
|
||||
- Best practice: Batch operations when possible
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Authentication Failed
|
||||
```bash
|
||||
# Verify PAT is set
|
||||
echo $AZURE_DEVOPS_PAT
|
||||
|
||||
# Test connection
|
||||
python3 scripts/ado_pr_helper.py [PR_ID]
|
||||
```
|
||||
|
||||
### Issue: PR Not Found
|
||||
- Verify PR ID is correct
|
||||
- Check repository configuration
|
||||
- Ensure PAT has read permissions
|
||||
|
||||
### Issue: Context Overflow
|
||||
- Use helper scripts instead of MCP tools
|
||||
- Filter results to essentials only
|
||||
- Return summaries not raw JSON
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Planned additions:
|
||||
- Work item helper functions
|
||||
- Pipeline operation helpers
|
||||
- Repository statistics
|
||||
- Build validation queries
|
||||
- Wiki management
|
||||
|
||||
---
|
||||
|
||||
**Created**: 2025-11-09
|
||||
**Version**: 1.0
|
||||
**Maintainer**: AI Agent Team
|
||||
**Status**: Production Ready
|
||||
288
skills/mcp-code-execution.md
Normal file
288
skills/mcp-code-execution.md
Normal file
@@ -0,0 +1,288 @@
|
||||
---
|
||||
name: mcp-code-execution
|
||||
description: Context-efficient MCP integration using code execution patterns. Use when building agents that interact with MCP servers, need to manage large tool sets (50+ tools), process large datasets through tools, or require multi-step workflows with intermediate results. Enables progressive tool loading, data filtering before context, and reusable skill persistence. (project, gitignored)
|
||||
---
|
||||
|
||||
# MCP Code Execution
|
||||
|
||||
Implement context-efficient MCP integrations using code execution patterns instead of direct tool calls.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Load this skill when you need to:
|
||||
- Work with MCP servers that expose 50+ tools (avoid context pollution)
|
||||
- Process large datasets through MCP tools (filter before returning to context)
|
||||
- Build multi-step workflows with intermediate results
|
||||
- Create reusable skill functions that persist across sessions
|
||||
- Progressively discover and load only needed tools
|
||||
- Achieve 98%+ context savings on MCP-heavy workflows
|
||||
|
||||
## Core Concept
|
||||
|
||||
Present MCP servers as code APIs on a filesystem. Load tool definitions on-demand, process data in execution environment, only return filtered results to context.
|
||||
|
||||
**Context Efficiency**:
|
||||
- **Before**: 150K tokens (all tool definitions + intermediate results)
|
||||
- **After**: 2K tokens (only used tools + filtered results)
|
||||
- **Savings**: 98.7%
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Generate Tool API from MCP Server
|
||||
|
||||
```bash
|
||||
python scripts/mcp_generator.py --server-config servers.json --output ./mcp_tools
|
||||
```
|
||||
|
||||
Creates a filesystem API:
|
||||
```
|
||||
mcp_tools/
|
||||
├── google_drive/
|
||||
│ ├── get_document.py
|
||||
│ └── list_files.py
|
||||
├── salesforce/
|
||||
│ ├── update_record.py
|
||||
│ └── query.py
|
||||
└── client.py # MCP client wrapper
|
||||
```
|
||||
|
||||
### 2. Use Context-Efficient Patterns
|
||||
|
||||
```python
|
||||
import mcp_tools.google_drive as gdrive
|
||||
import mcp_tools.salesforce as sf
|
||||
|
||||
# Filter data before returning to context
|
||||
sheet = await gdrive.get_sheet("abc123")
|
||||
pending = [r for r in sheet if r["Status"] == "pending"]
|
||||
print(f"Found {len(pending)} pending orders") # Only summary in context
|
||||
|
||||
# Chain operations without intermediate context pollution
|
||||
doc = await gdrive.get_document("xyz789")
|
||||
await sf.update_record("Lead", "00Q123", {"Notes": doc["content"]})
|
||||
print("Document attached to lead") # Only confirmation in context
|
||||
```
|
||||
|
||||
### 3. Discover Tools Progressively
|
||||
|
||||
```python
|
||||
from scripts.tool_discovery import discover_tools, load_tool_definition
|
||||
|
||||
# List available servers
|
||||
servers = discover_tools("./mcp_tools")
|
||||
# ['google_drive', 'salesforce']
|
||||
|
||||
# Load only needed tool definitions
|
||||
tool = load_tool_definition("./mcp_tools/google_drive/get_document.py")
|
||||
```
|
||||
|
||||
## Multi-Agent Workflow
|
||||
|
||||
For complex tasks, delegate to specialized sub-agents:
|
||||
|
||||
1. **Discovery Agent**: Explores available tools, returns relevant paths
|
||||
2. **Execution Agent**: Writes and runs context-efficient code
|
||||
3. **Filtering Agent**: Processes results, returns minimal context
|
||||
|
||||
## Documentation Structure
|
||||
|
||||
This skill has comprehensive documentation organized by topic:
|
||||
|
||||
### Quick Reference
|
||||
- **`QUICK_START.md`** - 5-minute getting started guide
|
||||
- Installation and setup
|
||||
- First MCP integration
|
||||
- Common patterns
|
||||
- Troubleshooting
|
||||
|
||||
### Core Concepts
|
||||
- **`SKILL.md`** - Complete skill specification
|
||||
- Context optimization techniques
|
||||
- Tool discovery strategies
|
||||
- Privacy and security
|
||||
- Advanced patterns (aggregation, joins, polling, batching)
|
||||
|
||||
### Integration Guide
|
||||
- **`ADDING_MCP_SERVERS.md`** - How to add new MCP servers
|
||||
- Server configuration
|
||||
- Tool generation
|
||||
- Custom adapters
|
||||
- Testing and validation
|
||||
|
||||
### Supporting Files
|
||||
- **`examples/`** - Working code examples
|
||||
- **`references/`** - Pattern libraries and references
|
||||
- **`scripts/`** - Helper utilities (mcp_generator.py, tool_discovery.py)
|
||||
- **`mcp_configs/`** - Server configuration templates
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Azure DevOps MCP (Current Project)
|
||||
|
||||
**Without this approach**:
|
||||
- Load ADO MCP → 50+ tools → 10,000-25,000 tokens
|
||||
|
||||
**With this approach**:
|
||||
```python
|
||||
from scripts.ado_pr_helper import ADOHelper
|
||||
|
||||
ado = ADOHelper()
|
||||
pr = ado.get_pr(5860)
|
||||
print(f"PR {pr['title']}: {pr['mergeStatus']}")
|
||||
# Only 500-2,000 tokens
|
||||
```
|
||||
|
||||
### 2. Data Pipeline Integration
|
||||
|
||||
```python
|
||||
# Fetch from Google Sheets, process, push to Salesforce
|
||||
sheet = await gdrive.get_sheet("pipeline_data")
|
||||
validated = [r for r in sheet if validate_record(r)]
|
||||
for record in validated:
|
||||
await sf.create_record("Lead", record)
|
||||
print(f"Processed {len(validated)} records")
|
||||
```
|
||||
|
||||
### 3. Multi-Source Aggregation
|
||||
|
||||
```python
|
||||
# Aggregate from multiple sources without context bloat
|
||||
github_issues = await github.list_issues(repo="project")
|
||||
jira_tickets = await jira.search("project = PROJ")
|
||||
combined = merge_and_dedupe(github_issues, jira_tickets)
|
||||
print(f"Total issues: {len(combined)}")
|
||||
```
|
||||
|
||||
## Tool Discovery Strategies
|
||||
|
||||
### Filesystem Exploration
|
||||
List `./mcp_tools/` directory, read specific tool files as needed.
|
||||
|
||||
### Search-Based Discovery
|
||||
```python
|
||||
from scripts.tool_discovery import search_tools
|
||||
|
||||
tools = search_tools("./mcp_tools", query="salesforce lead", detail="name_only")
|
||||
# Returns: ['salesforce/query.py', 'salesforce/update_record.py']
|
||||
```
|
||||
|
||||
### Lazy Loading
|
||||
Only read full tool definitions when about to use them.
|
||||
|
||||
## Persisting Skills
|
||||
|
||||
Save working code as reusable functions:
|
||||
|
||||
```python
|
||||
# ./skills/extract_pending_orders.py
|
||||
async def extract_pending_orders(sheet_id: str):
|
||||
sheet = await gdrive.get_sheet(sheet_id)
|
||||
return [r for r in sheet if r["Status"] == "pending"]
|
||||
```
|
||||
|
||||
## Privacy & Security
|
||||
|
||||
Data processed in execution environment stays there by default. Only explicitly logged/returned values enter context.
|
||||
|
||||
## Integration with Project
|
||||
|
||||
### With Azure DevOps
|
||||
- `azure-devops` skill uses this pattern via `ado_pr_helper.py`
|
||||
- Avoids loading 50+ ADO MCP tools
|
||||
- Returns filtered PR/work item summaries
|
||||
|
||||
### With Git Manager
|
||||
- PR operations use context-efficient ADO helpers
|
||||
- Work item linking without full MCP tool loading
|
||||
|
||||
### With Documentation
|
||||
- Potential future: Wiki operations via MCP
|
||||
|
||||
## Best Practices
|
||||
|
||||
### DO
|
||||
- ✅ Generate filesystem APIs for MCP servers
|
||||
- ✅ Filter data before returning to context
|
||||
- ✅ Use progressive tool discovery
|
||||
- ✅ Persist working code as reusable skills
|
||||
- ✅ Return summaries instead of full datasets
|
||||
- ✅ Chain operations to minimize intermediate context
|
||||
|
||||
### DON'T
|
||||
- ❌ Load all MCP tools into context upfront
|
||||
- ❌ Return large datasets to context unfiltered
|
||||
- ❌ Re-discover tools repeatedly (cache discovery)
|
||||
- ❌ Mix tool definitions with execution code
|
||||
- ❌ Expose sensitive data in print statements
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
| Metric | Direct MCP Tools | Code Execution Pattern | Improvement |
|
||||
|--------|------------------|------------------------|-------------|
|
||||
| Context Usage | 150K tokens | 2K tokens | 98.7% reduction |
|
||||
| Initial Load | 10K-25K tokens | 500 tokens | 95% reduction |
|
||||
| Result Size | 50K tokens | 1K tokens | 98% reduction |
|
||||
| Workflow Speed | Slow (context overhead) | Fast (in-process) | 5-10x faster |
|
||||
|
||||
## Quick Command Reference
|
||||
|
||||
### Generate MCP Tools
|
||||
```bash
|
||||
python scripts/mcp_generator.py --server-config servers.json --output ./mcp_tools
|
||||
```
|
||||
|
||||
### Discover Available Tools
|
||||
```bash
|
||||
python scripts/tool_discovery.py --mcp-dir ./mcp_tools
|
||||
```
|
||||
|
||||
### Test Tool Integration
|
||||
```bash
|
||||
python scripts/test_mcp_tool.py google_drive/get_document
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Tool Generation Failed
|
||||
- Verify MCP server is running
|
||||
- Check server configuration in servers.json
|
||||
- Review MCP client connection
|
||||
|
||||
### Issue: Import Errors
|
||||
- Ensure mcp_tools/ is in Python path
|
||||
- Check client.py is generated correctly
|
||||
- Verify all dependencies installed
|
||||
|
||||
### Issue: Context Still Large
|
||||
- Review what data is being returned
|
||||
- Add more aggressive filtering
|
||||
- Use summary statistics instead of raw data
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Planned additions:
|
||||
- Auto-generate README for each MCP server
|
||||
- Tool usage analytics and recommendations
|
||||
- Cached tool discovery
|
||||
- Multi-MCP orchestration patterns
|
||||
|
||||
## Getting Started
|
||||
|
||||
1. **New to MCP Code Execution?** → Read `QUICK_START.md`
|
||||
2. **Adding a new MCP server?** → Read `ADDING_MCP_SERVERS.md`
|
||||
3. **Need advanced patterns?** → Read `SKILL.md` sections on aggregation, joins, polling
|
||||
4. **Want examples?** → Browse `examples/` directory
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **azure-devops** - Uses this pattern for ADO MCP integration
|
||||
- **multi-agent-orchestration** - Delegates MCP work to specialized agents
|
||||
- **skill-creator** - Create reusable MCP integration skills
|
||||
|
||||
---
|
||||
|
||||
**Created**: 2025-11-09
|
||||
**Version**: 1.0
|
||||
**Documentation**: 15,411 lines total (SKILL.md: 3,550, ADDING_MCP_SERVERS.md: 7,667, QUICK_START.md: 4,194)
|
||||
**Maintainer**: AI Agent Team
|
||||
**Status**: Production Ready
|
||||
866
skills/multi-agent-orchestration.md
Normal file
866
skills/multi-agent-orchestration.md
Normal file
@@ -0,0 +1,866 @@
|
||||
---
|
||||
description: Enable Claude to orchestrate complex tasks by spawning and managing specialized sub-agents for parallel or sequential decomposition. Use when tasks have clear independent subtasks, require specialized approaches for different components, benefit from parallel processing, need fault isolation, or involve complex state management across multiple steps. Best for data pipelines, code analysis workflows, content creation pipelines, and multi-stage processing tasks.
|
||||
tags: [orchestration, agents, parallel, automation, workflow]
|
||||
visibility: project
|
||||
---
|
||||
|
||||
# Multi-Agent Orchestration Skill
|
||||
|
||||
This skill provides intelligent task orchestration by routing work to the most appropriate execution strategy: planning discussion, single-agent background execution, or multi-agent parallel orchestration.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill PROACTIVELY when:
|
||||
- Tasks require more than one sequential step
|
||||
- Work can be parallelized across multiple independent components
|
||||
- You need to analyze complexity before deciding execution strategy
|
||||
- Tasks involve multiple files, layers, or domains (bronze/silver/gold)
|
||||
- Code quality sweeps across multiple directories
|
||||
- Feature implementation spanning multiple modules
|
||||
- Complex refactoring or optimization work
|
||||
- Pipeline validation or testing across all layers
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
This skill integrates three orchestration commands:
|
||||
|
||||
### 1. `/aa_command` - Orchestration Strategy Discussion
|
||||
**Purpose**: Analyze task complexity and recommend execution approach
|
||||
|
||||
**Use when**:
|
||||
- Task complexity is unclear
|
||||
- User needs guidance on best orchestration approach
|
||||
- Want to plan before executing
|
||||
- Determining optimal agent count and decomposition strategy
|
||||
|
||||
**Output**:
|
||||
- Task complexity assessment (Simple/Moderate/High)
|
||||
- Recommended approach (`/background` or `/orchestrate`)
|
||||
- Agent breakdown (if using orchestrate)
|
||||
- Dependency analysis (None/Sequential/Hybrid)
|
||||
- Estimated time
|
||||
- Concrete next steps with example commands
|
||||
|
||||
### 2. `/background` - Single Agent Background Execution
|
||||
**Purpose**: Launch one specialized PySpark data engineer agent to work autonomously
|
||||
|
||||
**Use when**:
|
||||
- Task is focused on 1-3 related files
|
||||
- Work is sequential and non-parallelizable
|
||||
- Complexity is moderate (not requiring decomposition)
|
||||
- Single domain/layer work (e.g., fixing one gold table)
|
||||
- Code review fixes for specific component
|
||||
- Targeted optimization or refactoring
|
||||
|
||||
**Agent Type**: `pyspark-data-engineer`
|
||||
|
||||
**Capabilities**:
|
||||
- Autonomous task execution
|
||||
- Quality gate validation (syntax, linting, formatting)
|
||||
- Comprehensive reporting
|
||||
- Follows medallion architecture patterns
|
||||
- Uses project utilities (SparkOptimiser, TableUtilities, NotebookLogger)
|
||||
|
||||
### 3. `/orchestrate` - Multi-Agent Parallel Orchestration
|
||||
**Purpose**: Coordinate 2-8 worker agents executing independent subtasks in parallel
|
||||
|
||||
**Use when**:
|
||||
- Task has 2+ independent subtasks
|
||||
- Work can run in parallel
|
||||
- Complexity is high (benefits from decomposition)
|
||||
- Cross-layer or cross-domain work (multiple bronze/silver/gold tables)
|
||||
- Code quality sweeps across multiple directories
|
||||
- Feature implementation requiring parallel development
|
||||
- Bulk operations on many files
|
||||
|
||||
**Agent Type**: `general-purpose` orchestrator managing `general-purpose` workers
|
||||
|
||||
**Capabilities**:
|
||||
- Task decomposition into 2-8 subtasks
|
||||
- Parallel agent launch and coordination
|
||||
- JSON-based structured communication
|
||||
- Quality validation across all agents
|
||||
- Consolidated metrics and reporting
|
||||
- Graceful failure handling
|
||||
|
||||
## Orchestration Decision Flow
|
||||
|
||||
```
|
||||
User Task
|
||||
↓
|
||||
Is complexity unclear?
|
||||
YES → /aa_command (analyze and recommend)
|
||||
NO ↓
|
||||
Is task decomposable into 2+ independent subtasks?
|
||||
NO → /background (single focused agent)
|
||||
YES ↓
|
||||
How many independent subtasks?
|
||||
2-8 → /orchestrate (parallel multi-agent)
|
||||
>8 → Recommend breaking into phases or refining decomposition
|
||||
```
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### Pattern 1: Planning First
|
||||
When task complexity is unclear, start with strategy discussion:
|
||||
|
||||
```
|
||||
User: "I need to improve performance across all gold tables"
|
||||
|
||||
You: [Invoke /aa_command to analyze complexity]
|
||||
|
||||
aa_command analyzes:
|
||||
- Task complexity: HIGH
|
||||
- Recommended: /orchestrate
|
||||
- Agent breakdown:
|
||||
- Agent 1: Analyze g_x_mg_* tables for bottlenecks
|
||||
- Agent 2: Analyze g_xa_* tables for bottlenecks
|
||||
- Agent 3: Review joins and aggregations across all tables
|
||||
- Agent 4: Check indexing and partitioning strategies
|
||||
- Agent 5: Implement optimization changes
|
||||
- Agent 6: Validate performance improvements
|
||||
- Estimated time: 45-60 minutes
|
||||
|
||||
Then you proceed with /orchestrate based on recommendation
|
||||
```
|
||||
|
||||
### Pattern 2: Direct Background Execution
|
||||
When task is clearly focused and non-decomposable:
|
||||
|
||||
```
|
||||
User: "Fix the validation issues in g_xa_mg_statsclasscount.py"
|
||||
|
||||
You: [Invoke /background directly]
|
||||
- Task: Single file, focused fix
|
||||
- Agent: pyspark-data-engineer
|
||||
- Estimated time: 10-15 minutes
|
||||
```
|
||||
|
||||
### Pattern 3: Direct Orchestration
|
||||
When parallelization is obvious:
|
||||
|
||||
```
|
||||
User: "Fix all linting errors across silver_cms, silver_fvms, and silver_nicherms"
|
||||
|
||||
You: [Invoke /orchestrate directly]
|
||||
- Subtasks clearly decomposable
|
||||
- 3 independent agents (one per database)
|
||||
- Parallel execution
|
||||
- Estimated time: 15-20 minutes
|
||||
```
|
||||
|
||||
### Pattern 4: Task File Usage
|
||||
When user has prepared a detailed task file:
|
||||
|
||||
```
|
||||
User: "/background code_review_fixes.md"
|
||||
|
||||
You: [Invoke /background with task file]
|
||||
- Reads .claude/tasks/code_review_fixes.md
|
||||
- Launches agent with complete task context
|
||||
- Executes all tasks in the file
|
||||
```
|
||||
|
||||
## Task File Structure
|
||||
|
||||
Task files live in `.claude/tasks/` directory.
|
||||
|
||||
### Background Task File Format
|
||||
|
||||
```markdown
|
||||
# Task Title
|
||||
|
||||
**Date Created**: 2025-11-07
|
||||
**Priority**: HIGH/MEDIUM/LOW
|
||||
**Estimated Total Time**: X minutes
|
||||
**Files Affected**: N
|
||||
|
||||
## Task 1: Description
|
||||
**File**: python_files/gold/g_xa_mg_statsclasscount.py
|
||||
**Line**: 45
|
||||
**Estimated Time**: 5 minutes
|
||||
**Severity**: HIGH
|
||||
|
||||
**Current Code**:
|
||||
```python
|
||||
# problematic code
|
||||
```
|
||||
|
||||
**Required Fix**:
|
||||
```python
|
||||
# fixed code
|
||||
```
|
||||
|
||||
**Reason**: Explanation of why this needs fixing
|
||||
**Testing**: How to verify the fix works
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Description
|
||||
...
|
||||
```
|
||||
|
||||
### Orchestration Task File Format
|
||||
|
||||
```markdown
|
||||
# Orchestration Task Title
|
||||
|
||||
**Date Created**: 2025-11-07
|
||||
**Priority**: HIGH
|
||||
**Estimated Total Time**: X minutes
|
||||
**Complexity**: High
|
||||
**Recommended Worker Agents**: 5
|
||||
|
||||
## Main Objective
|
||||
Clear description of the overall goal
|
||||
|
||||
## Success Criteria
|
||||
- [ ] Criterion 1
|
||||
- [ ] Criterion 2
|
||||
- [ ] Criterion 3
|
||||
|
||||
## Suggested Subtask Decomposition
|
||||
|
||||
### Subtask 1: Title
|
||||
**Scope**: Files/components affected
|
||||
**Estimated Time**: X minutes
|
||||
**Dependencies**: None
|
||||
|
||||
**Description**: What needs to be done
|
||||
|
||||
**Expected Outputs**:
|
||||
- Output 1
|
||||
- Output 2
|
||||
|
||||
---
|
||||
|
||||
### Subtask 2: Title
|
||||
...
|
||||
```
|
||||
|
||||
## JSON Communication Protocol
|
||||
|
||||
All orchestrated agents communicate using structured JSON format.
|
||||
|
||||
### Worker Agent Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"agent_id": "agent_1",
|
||||
"task_assigned": "Fix linting in silver_cms files",
|
||||
"status": "completed",
|
||||
"results": {
|
||||
"files_modified": [
|
||||
"python_files/silver/silver_cms/s_cms_case_file.py",
|
||||
"python_files/silver/silver_cms/s_cms_offence_report.py"
|
||||
],
|
||||
"changes_summary": "Fixed 23 linting issues across 2 files",
|
||||
"metrics": {
|
||||
"lines_added": 15,
|
||||
"lines_removed": 8,
|
||||
"functions_added": 0,
|
||||
"issues_fixed": 23
|
||||
}
|
||||
},
|
||||
"quality_checks": {
|
||||
"syntax_check": "passed",
|
||||
"linting": "passed",
|
||||
"formatting": "passed"
|
||||
},
|
||||
"issues_encountered": [],
|
||||
"recommendations": ["Consider adding type hints to helper functions"],
|
||||
"execution_time_seconds": 180
|
||||
}
|
||||
```
|
||||
|
||||
### Orchestrator Final Report Format
|
||||
|
||||
```json
|
||||
{
|
||||
"orchestration_summary": {
|
||||
"main_task": "Fix all linting errors across silver layer",
|
||||
"total_agents_launched": 3,
|
||||
"successful_agents": 3,
|
||||
"failed_agents": 0,
|
||||
"total_execution_time_seconds": 540
|
||||
},
|
||||
"agent_results": [
|
||||
{...},
|
||||
{...},
|
||||
{...}
|
||||
],
|
||||
"consolidated_metrics": {
|
||||
"total_files_modified": 15,
|
||||
"total_lines_added": 127,
|
||||
"total_lines_removed": 84,
|
||||
"total_functions_added": 3,
|
||||
"total_issues_fixed": 89
|
||||
},
|
||||
"quality_validation": {
|
||||
"all_syntax_checks_passed": true,
|
||||
"all_linting_passed": true,
|
||||
"all_formatting_passed": true
|
||||
},
|
||||
"consolidated_issues": [],
|
||||
"consolidated_recommendations": [
|
||||
"Consider adding type hints across all silver layer files",
|
||||
"Review error handling patterns for consistency"
|
||||
],
|
||||
"next_steps": [
|
||||
"Run full test suite: python -m pytest python_files/testing/",
|
||||
"Execute silver layer pipeline: make run_silver",
|
||||
"Validate output in DuckDB: make harly"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Quality Gates
|
||||
|
||||
All agents (background and orchestrated) MUST run these quality gates before completion:
|
||||
|
||||
1. **Syntax Validation**: `python3 -m py_compile <file_path>`
|
||||
2. **Linting**: `ruff check python_files/`
|
||||
3. **Formatting**: `ruff format python_files/`
|
||||
|
||||
Quality check results are included in JSON responses and validated by orchestrator.
|
||||
|
||||
## Complexity Assessment Guidelines
|
||||
|
||||
### Simple (Use /background)
|
||||
- 1-3 related files
|
||||
- Single layer (bronze, silver, or gold)
|
||||
- Sequential steps
|
||||
- Focused scope
|
||||
- Estimated time: <20 minutes
|
||||
|
||||
**Examples**:
|
||||
- Fix validation in one gold table
|
||||
- Add logging to a specific module
|
||||
- Refactor one ETL class
|
||||
- Update configuration for one component
|
||||
|
||||
### Moderate (Consider /background or /orchestrate)
|
||||
- 4-8 files
|
||||
- Single or multiple layers
|
||||
- Some parallelizable work
|
||||
- Medium scope
|
||||
- Estimated time: 20-40 minutes
|
||||
|
||||
**Decision factors**:
|
||||
- If files are tightly coupled → /background
|
||||
- If files are independent → /orchestrate
|
||||
|
||||
**Examples**:
|
||||
- Fix linting across one database (e.g., silver_cms)
|
||||
- Optimize all gold tables with same pattern
|
||||
- Add feature to one layer
|
||||
|
||||
### High (Use /orchestrate)
|
||||
- 8+ files OR cross-layer work
|
||||
- Multiple independent components
|
||||
- Highly parallelizable
|
||||
- Broad scope
|
||||
- Estimated time: 40+ minutes
|
||||
|
||||
**Examples**:
|
||||
- Fix linting across all layers
|
||||
- Implement feature across bronze/silver/gold
|
||||
- Code quality sweep across entire project
|
||||
- Performance optimization for all tables
|
||||
- Test suite creation for full pipeline
|
||||
|
||||
## Agent Configuration
|
||||
|
||||
### Background Agent
|
||||
```python
|
||||
Task(
|
||||
subagent_type="pyspark-data-engineer",
|
||||
model="sonnet", # or "opus" for complex tasks
|
||||
description="Fix gold table validation",
|
||||
prompt="""
|
||||
You are a PySpark data engineer working on Unify 2.1 Data Migration.
|
||||
|
||||
CRITICAL INSTRUCTIONS:
|
||||
- Read and follow .claude/CLAUDE.md
|
||||
- Use .claude/rules/python_rules.md for coding standards
|
||||
- Maximum line length: 240 characters
|
||||
- No blank lines inside functions
|
||||
- Use @synapse_error_print_handler decorator
|
||||
- Use NotebookLogger for logging
|
||||
- Use TableUtilities for DataFrame operations
|
||||
|
||||
TASK: {task_content}
|
||||
|
||||
QUALITY GATES (MUST RUN):
|
||||
1. python3 -m py_compile <file_path>
|
||||
2. ruff check python_files/
|
||||
3. ruff format python_files/
|
||||
|
||||
Provide comprehensive final report with:
|
||||
- Summary of changes
|
||||
- Files modified with line numbers
|
||||
- Quality gate results
|
||||
- Testing recommendations
|
||||
- Issues and resolutions
|
||||
- Next steps
|
||||
"""
|
||||
)
|
||||
```
|
||||
|
||||
### Orchestrator Agent
|
||||
```python
|
||||
Task(
|
||||
subagent_type="general-purpose",
|
||||
model="sonnet", # or "opus" for very complex orchestrations
|
||||
description="Orchestrate pipeline optimization",
|
||||
prompt="""
|
||||
You are an ORCHESTRATOR AGENT coordinating multiple worker agents.
|
||||
|
||||
PROJECT CONTEXT:
|
||||
- Project: Unify 2.1 Data Migration using Azure Synapse Analytics
|
||||
- Architecture: Medallion pattern (Bronze/Silver/Gold)
|
||||
- Language: PySpark Python
|
||||
- Follow: .claude/CLAUDE.md and .claude/rules/python_rules.md
|
||||
|
||||
YOUR RESPONSIBILITIES:
|
||||
1. Analyze task and decompose into 2-8 subtasks
|
||||
2. Launch worker agents (Task tool, subagent_type="general-purpose")
|
||||
3. Provide clear instructions with JSON response format
|
||||
4. Collect and validate all worker responses
|
||||
5. Aggregate results and metrics
|
||||
6. Produce final consolidated report
|
||||
|
||||
MAIN TASK: {task_content}
|
||||
|
||||
WORKER JSON FORMAT:
|
||||
{
|
||||
"agent_id": "unique_id",
|
||||
"task_assigned": "description",
|
||||
"status": "completed|failed|partial",
|
||||
"results": {...},
|
||||
"quality_checks": {...},
|
||||
"issues_encountered": [...],
|
||||
"recommendations": [...],
|
||||
"execution_time_seconds": 0
|
||||
}
|
||||
|
||||
Work autonomously and orchestrate complete task execution.
|
||||
"""
|
||||
)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Worker Agent Failures
|
||||
- Orchestrator captures failure details
|
||||
- Marks agent status as "failed"
|
||||
- Continues with other agents
|
||||
- Reports failure in final summary
|
||||
- Suggests recovery steps
|
||||
|
||||
### JSON Parse Errors
|
||||
- Orchestrator logs parse error
|
||||
- Attempts partial result extraction
|
||||
- Marks response as invalid
|
||||
- Flags for manual review
|
||||
- Continues with valid responses
|
||||
|
||||
### Quality Check Failures
|
||||
- Orchestrator flags the failure
|
||||
- Includes failure details in report
|
||||
- Prevents final approval
|
||||
- Suggests corrective actions
|
||||
- May relaunch worker with corrections
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Parallel Execution
|
||||
- Launch all independent agents simultaneously
|
||||
- Use Task tool with multiple concurrent calls in single message
|
||||
- Maximize parallelism for faster completion
|
||||
- Monitor resource utilization
|
||||
|
||||
### Agent Sizing
|
||||
- **2-8 agents**: Optimal for most orchestrated tasks
|
||||
- **<2 agents**: Use `/background` instead
|
||||
- **>8 agents**: Consider phased approach or refinement
|
||||
- Balance granularity vs coordination overhead
|
||||
|
||||
### Context Management
|
||||
- Provide minimal necessary context
|
||||
- Avoid duplicating shared information
|
||||
- Reference shared documentation (.claude/CLAUDE.md)
|
||||
- Keep prompts focused and concise
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Task Decomposition
|
||||
- Break into 2-8 independent subtasks
|
||||
- Avoid inter-agent dependencies when possible
|
||||
- Balance workload across agents
|
||||
- Group related work logically
|
||||
- Consider file/component boundaries
|
||||
- Respect layer separation (bronze/silver/gold)
|
||||
|
||||
### Worker Prompts
|
||||
- Make each subtask self-contained
|
||||
- Include all necessary context
|
||||
- Specify exact file paths and requirements
|
||||
- Define clear success criteria
|
||||
- Require JSON response format
|
||||
- Include quality gate validation
|
||||
- Request execution metrics
|
||||
|
||||
### Orchestrator Design
|
||||
- Keep worker tasks independent
|
||||
- Provide complete context to each worker
|
||||
- Assign unique, meaningful agent IDs
|
||||
- Specify clear JSON response requirements
|
||||
- Validate all JSON responses
|
||||
- Handle worker failures gracefully
|
||||
- Aggregate results systematically
|
||||
- Provide actionable consolidated report
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Strategy Discussion
|
||||
```
|
||||
User: "I need to improve data quality across all layers"
|
||||
|
||||
Claude: [Invokes /aa_command]
|
||||
|
||||
Analysis:
|
||||
- Task Complexity: HIGH
|
||||
- Recommended Approach: /orchestrate
|
||||
- Agent Breakdown:
|
||||
- Agent 1: Design quality framework and base classes
|
||||
- Agent 2: Implement bronze layer validators
|
||||
- Agent 3: Implement silver layer validators
|
||||
- Agent 4: Implement gold layer validators
|
||||
- Agent 5: Create validation tests
|
||||
- Agent 6: Update documentation
|
||||
- Dependencies: Agent 1 must complete before 2-4 (hybrid approach)
|
||||
- Estimated Time: 50-70 minutes
|
||||
- Next Steps: Create task file, then run /orchestrate data_quality_framework.md
|
||||
```
|
||||
|
||||
### Example 2: Background Single Agent
|
||||
```
|
||||
User: "Fix the transformation logic in g_xa_mg_statsclasscount.py"
|
||||
|
||||
Claude: [Invokes /background directly]
|
||||
|
||||
Launched: pyspark-data-engineer agent
|
||||
Task: Fix transformation logic in single gold table
|
||||
Estimated Time: 12-15 minutes
|
||||
Agent will:
|
||||
- Read g_xa_mg_statsclasscount.py
|
||||
- Analyze transformation logic issues
|
||||
- Implement fixes following medallion patterns
|
||||
- Run quality gates (syntax, linting, formatting)
|
||||
- Provide comprehensive report
|
||||
```
|
||||
|
||||
### Example 3: Parallel Orchestration
|
||||
```
|
||||
User: "Fix all linting errors across the silver layer"
|
||||
|
||||
Claude: [Invokes /orchestrate directly]
|
||||
|
||||
Launched: Orchestrator agent coordinating 3 workers
|
||||
- Worker 1: Fix silver_cms linting errors
|
||||
- Worker 2: Fix silver_fvms linting errors
|
||||
- Worker 3: Fix silver_nicherms linting errors
|
||||
Execution: Fully parallel (no dependencies)
|
||||
Estimated Time: 15-20 minutes
|
||||
|
||||
Orchestrator will:
|
||||
- Launch 3 agents simultaneously
|
||||
- Collect JSON responses from each
|
||||
- Validate quality checks passed
|
||||
- Aggregate metrics (files modified, issues fixed)
|
||||
- Produce consolidated report
|
||||
```
|
||||
|
||||
### Example 4: Task File Execution
|
||||
```
|
||||
User: "/background code_review_fixes.md"
|
||||
|
||||
Claude: [Invokes /background with task file]
|
||||
|
||||
Found: .claude/tasks/code_review_fixes.md
|
||||
Tasks: 9 code review fixes across 5 files
|
||||
Priority: HIGH
|
||||
Estimated Time: 27 minutes
|
||||
|
||||
Agent will:
|
||||
- Read task file with detailed fix instructions
|
||||
- Execute all 9 fixes sequentially
|
||||
- Validate each fix with quality gates
|
||||
- Provide comprehensive report on all changes
|
||||
```
|
||||
|
||||
### Example 5: Complex Orchestration with Task File
|
||||
```
|
||||
User: "/orchestrate pipeline_optimization.md"
|
||||
|
||||
Claude: [Invokes /orchestrate with task file]
|
||||
|
||||
Found: .claude/tasks/pipeline_optimization.md
|
||||
Recommended Agents: 6
|
||||
Complexity: HIGH
|
||||
Estimated Time: 60 minutes
|
||||
|
||||
Task file suggests decomposition:
|
||||
- Agent 1: Profile bronze layer performance
|
||||
- Agent 2: Profile silver layer performance
|
||||
- Agent 3: Profile gold layer performance
|
||||
- Agent 4: Analyze join strategies
|
||||
- Agent 5: Implement optimization changes
|
||||
- Agent 6: Validate performance improvements
|
||||
|
||||
Orchestrator will coordinate all 6 agents and produce consolidated metrics.
|
||||
```
|
||||
|
||||
## Command Reference
|
||||
|
||||
### /aa_command - Strategy Discussion
|
||||
```bash
|
||||
# Analyze task complexity
|
||||
/aa_command "optimize all gold tables"
|
||||
|
||||
# Get approach recommendations
|
||||
/aa_command "implement monitoring across layers"
|
||||
|
||||
# Plan refactoring work
|
||||
/aa_command "update all ETL classes to new pattern"
|
||||
```
|
||||
|
||||
**Output**: Complexity assessment, recommended approach, agent breakdown, next steps
|
||||
|
||||
### /background - Single Agent
|
||||
```bash
|
||||
# Direct prompt
|
||||
/background "fix validation in g_xa_mg_statsclasscount.py"
|
||||
|
||||
# Task file
|
||||
/background code_review_fixes.md
|
||||
|
||||
# List available task files
|
||||
/background list
|
||||
```
|
||||
|
||||
**Output**: Agent launch confirmation, estimated time, final comprehensive report
|
||||
|
||||
### /orchestrate - Multi-Agent
|
||||
```bash
|
||||
# Direct prompt
|
||||
/orchestrate "fix linting across all silver layer files"
|
||||
|
||||
# Task file
|
||||
/orchestrate data_quality_framework.md
|
||||
|
||||
# List available orchestration tasks
|
||||
/orchestrate list
|
||||
```
|
||||
|
||||
**Output**: Orchestrator launch confirmation, worker count, final JSON consolidated report
|
||||
|
||||
## Integration with Project Workflow
|
||||
|
||||
### With Git Operations
|
||||
```bash
|
||||
# 1. Run orchestration
|
||||
/orchestrate "optimize all gold tables"
|
||||
|
||||
# 2. After completion, commit changes
|
||||
/local-commit "feat: optimize gold layer performance"
|
||||
|
||||
# 3. Create PR
|
||||
/pr-feature-to-staging
|
||||
```
|
||||
|
||||
### With Testing
|
||||
```bash
|
||||
# 1. Run orchestration
|
||||
/background "add validation to gold tables"
|
||||
|
||||
# 2. After completion, write tests
|
||||
/write-tests --data-validation
|
||||
|
||||
# 3. Run tests
|
||||
make run_all
|
||||
```
|
||||
|
||||
### With Documentation
|
||||
```bash
|
||||
# 1. Run orchestration
|
||||
/orchestrate "implement new feature across layers"
|
||||
|
||||
# 2. After completion, update docs
|
||||
/update-docs --generate-local
|
||||
|
||||
# 3. Sync to wiki
|
||||
/update-docs --sync-to-wiki
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### For Background Agent
|
||||
- ✅ All code changes implemented
|
||||
- ✅ Syntax validation passes
|
||||
- ✅ Linting passes
|
||||
- ✅ Code formatted
|
||||
- ✅ No new issues introduced
|
||||
- ✅ Comprehensive final report provided
|
||||
|
||||
### For Orchestrated Agents
|
||||
- ✅ All worker agents launched successfully
|
||||
- ✅ All worker agents returned valid JSON responses
|
||||
- ✅ All quality checks passed across all agents
|
||||
- ✅ No unresolved issues or failures
|
||||
- ✅ Consolidated metrics calculated correctly
|
||||
- ✅ Comprehensive orchestration report provided
|
||||
- ✅ All files syntax validated
|
||||
- ✅ All files linted and formatted
|
||||
|
||||
## Limitations and Considerations
|
||||
|
||||
### When NOT to Use Multi-Agent Orchestration
|
||||
- Task is trivial (single file, simple change)
|
||||
- Work is highly sequential with tight dependencies
|
||||
- Task requires continuous user interaction
|
||||
- Subtasks cannot be clearly defined
|
||||
- Less than 2 independent components
|
||||
|
||||
**Alternative**: Use standard tools (Read, Edit, Write) or single `/background` agent
|
||||
|
||||
### Agent Count Guidelines
|
||||
- **2-3 agents**: Small to medium parallelizable tasks
|
||||
- **4-6 agents**: Medium to large tasks with clear decomposition
|
||||
- **7-8 agents**: Very large tasks with many independent components
|
||||
- **>8 agents**: Consider breaking into phases or hybrid approach
|
||||
|
||||
### Resource Considerations
|
||||
- Each agent consumes computational resources
|
||||
- Parallel execution may strain system resources
|
||||
- Monitor execution time across agents
|
||||
- Consider sequential phasing for very large tasks
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Task File Not Found
|
||||
**Solution**:
|
||||
- Check file exists in `.claude/tasks/`
|
||||
- Verify exact filename (case-sensitive)
|
||||
- Use `/background list` or `/orchestrate list` to see available files
|
||||
|
||||
### Issue: Agent Not Completing
|
||||
**Solution**:
|
||||
- Check agent complexity (may need more time)
|
||||
- Review task scope (may be too broad)
|
||||
- Consider breaking into smaller subtasks
|
||||
- Switch from `/orchestrate` to `/background` for simpler tasks
|
||||
|
||||
### Issue: Quality Gates Failing
|
||||
**Solution**:
|
||||
- Review code changes made by agent
|
||||
- Check for syntax errors or linting issues
|
||||
- Manually run quality gates to diagnose
|
||||
- May need to refine task instructions
|
||||
|
||||
### Issue: JSON Parse Errors
|
||||
**Solution**:
|
||||
- Check worker agent response format
|
||||
- Verify JSON structure is valid
|
||||
- Orchestrator should handle gracefully
|
||||
- Review worker prompt for JSON format requirements
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Hybrid Sequential-Parallel
|
||||
```
|
||||
Phase 1: Single agent designs framework
|
||||
↓ (outputs JSON schema)
|
||||
Phase 2: 4 agents implement in parallel using schema
|
||||
↓ (outputs implementations)
|
||||
Phase 3: Single agent validates and integrates
|
||||
```
|
||||
|
||||
### Recursive Orchestration
|
||||
```
|
||||
Main Orchestrator
|
||||
↓
|
||||
Sub-Orchestrator 1 (bronze layer)
|
||||
↓
|
||||
Workers: bronze_cms, bronze_fvms, bronze_nicherms
|
||||
↓
|
||||
Sub-Orchestrator 2 (silver layer)
|
||||
↓
|
||||
Workers: silver_cms, silver_fvms, silver_nicherms
|
||||
```
|
||||
|
||||
### Incremental Validation
|
||||
```
|
||||
Agent 1: Implement changes → Worker reports
|
||||
↓
|
||||
Orchestrator validates → Approves/Rejects
|
||||
↓
|
||||
Agent 2: Builds on Agent 1 → Worker reports
|
||||
↓
|
||||
Orchestrator validates → Approves/Rejects
|
||||
↓
|
||||
Continue...
|
||||
```
|
||||
|
||||
## Related Project Patterns
|
||||
|
||||
### Medallion Architecture Orchestration
|
||||
```
|
||||
Bronze Layer → Silver Layer → Gold Layer
|
||||
Each layer can have parallel agents:
|
||||
- bronze_cms, bronze_fvms, bronze_nicherms
|
||||
- silver_cms, silver_fvms, silver_nicherms
|
||||
- gold_x_mg, gold_xa, gold_xb
|
||||
```
|
||||
|
||||
### Quality Gate Orchestration
|
||||
```
|
||||
Agent 1: Syntax validation (all files)
|
||||
Agent 2: Linting (all files)
|
||||
Agent 3: Formatting (all files)
|
||||
Agent 4: Unit tests
|
||||
Agent 5: Integration tests
|
||||
Agent 6: Data validation tests
|
||||
```
|
||||
|
||||
### Feature Implementation Orchestration
|
||||
```
|
||||
Agent 1: Design and base classes
|
||||
Agent 2: Bronze layer implementation
|
||||
Agent 3: Silver layer implementation
|
||||
Agent 4: Gold layer implementation
|
||||
Agent 5: Testing suite
|
||||
Agent 6: Documentation
|
||||
Agent 7: Configuration updates
|
||||
```
|
||||
|
||||
## Skill Activation
|
||||
|
||||
This skill is loaded on-demand. When user requests involve:
|
||||
- "optimize all tables"
|
||||
- "fix across multiple layers"
|
||||
- "implement feature in all databases"
|
||||
- "code quality sweep"
|
||||
- Complex multi-step tasks
|
||||
|
||||
You should PROACTIVELY consider using this skill to route work appropriately.
|
||||
|
||||
## Further Reading
|
||||
|
||||
- `.claude/commands/aa_command.md` - Strategy discussion command
|
||||
- `.claude/commands/background.md` - Single agent background execution
|
||||
- `.claude/commands/orchestrate.md` - Multi-agent orchestration
|
||||
- `.claude/tasks/` - Example task files
|
||||
- `.claude/CLAUDE.md` - Project guidelines and patterns
|
||||
- `.claude/rules/python_rules.md` - Python coding standards
|
||||
161
skills/project-architecture.md
Normal file
161
skills/project-architecture.md
Normal file
@@ -0,0 +1,161 @@
|
||||
---
|
||||
name: project-architecture
|
||||
description: Detailed architecture, data flow, pipeline execution, dependencies, and system design for the Unify data migration project. Use when you need deep understanding of how components interact.
|
||||
---
|
||||
|
||||
# Project Architecture
|
||||
|
||||
Comprehensive architecture documentation for the Unify data migration project.
|
||||
|
||||
## Medallion Architecture Deep Dive
|
||||
|
||||
### Bronze Layer
|
||||
**Purpose**: Raw data ingestion from parquet files
|
||||
**Location**: `python_files/pipeline_operations/bronze_layer_deployment.py`
|
||||
**Process**:
|
||||
1. Lists parquet files from Azure ADLS Gen2 or local storage
|
||||
2. Creates bronze databases: `bronze_cms`, `bronze_fvms`, `bronze_nicherms`
|
||||
3. Reads parquet files and applies basic transformations
|
||||
4. Adds versioning, row hashes, and data source columns
|
||||
|
||||
### Silver Layer
|
||||
**Purpose**: Validated, standardized data organized by source
|
||||
**Location**: `python_files/silver/` (cms, fvms, nicherms subdirectories)
|
||||
**Process**:
|
||||
1. Drops and recreates silver databases
|
||||
2. Recursively finds all Python files in `python_files/silver/`
|
||||
3. Executes each silver transformation file in sorted order
|
||||
4. Uses threading for parallel execution (currently commented out)
|
||||
|
||||
### Gold Layer
|
||||
**Purpose**: Business-ready, aggregated analytical datasets
|
||||
**Location**: `python_files/gold/`
|
||||
**Process**:
|
||||
1. Creates business-ready analytical tables in `gold_data_model` database
|
||||
2. Executes transformations from `python_files/gold/`
|
||||
3. Aggregates and joins data across multiple silver tables
|
||||
|
||||
## Data Sources
|
||||
|
||||
### FVMS (Family Violence Management System)
|
||||
- **Tables**: 32 tables
|
||||
- **Key tables**: incident, person, address, risk_assessment
|
||||
- **Purpose**: Family violence incident tracking and management
|
||||
|
||||
### CMS (Case Management System)
|
||||
- **Tables**: 19 tables
|
||||
- **Key tables**: offence_report, case_file, person, victim
|
||||
- **Purpose**: Criminal offence investigation and case management
|
||||
|
||||
### NicheRMS (Records Management System)
|
||||
- **Tables**: 39 TBL_* tables
|
||||
- **Purpose**: Legacy records management system
|
||||
|
||||
## Azure Integration
|
||||
|
||||
### Storage (ADLS Gen2)
|
||||
- **Containers**: `bronze-layer`, `code-layer`, `legacy_ingestion`
|
||||
- **Authentication**: Managed Identity (`AZURE_MANAGED_IDENTITY_CLIENT_ID`)
|
||||
- **Path Pattern**: `abfss://container@account.dfs.core.windows.net/path`
|
||||
|
||||
### Key Services
|
||||
- **Key Vault**: `AuE-DataMig-Dev-KV` for secret management
|
||||
- **Synapse Workspace**: `auedatamigdevsynws`
|
||||
- **Spark Pool**: `dm8c64gb`
|
||||
|
||||
## Environment Detection Pattern
|
||||
|
||||
All processing scripts auto-detect their runtime environment:
|
||||
|
||||
```python
|
||||
if "/home/trusted-service-user" == env_vars["HOME"]:
|
||||
# Azure Synapse Analytics production environment
|
||||
import notebookutils.mssparkutils as mssparkutils
|
||||
spark = SparkOptimiser.get_optimised_spark_session()
|
||||
DATA_PATH_STRING = "abfss://code-layer@auedatamigdevlake.dfs.core.windows.net"
|
||||
else:
|
||||
# Local development environment using Docker Spark container
|
||||
from python_files.utilities.local_spark_connection import sparkConnector
|
||||
config = UtilityFunctions.get_settings_from_yaml("configuration.yaml")
|
||||
connector = sparkConnector(...)
|
||||
DATA_PATH_STRING = config["DATA_PATH_STRING"]
|
||||
```
|
||||
|
||||
## Core Utilities Architecture
|
||||
|
||||
### SparkOptimiser
|
||||
- Configured Spark session with optimized settings
|
||||
- Handles driver memory, encryption, authentication
|
||||
- Centralized session management
|
||||
|
||||
### NotebookLogger
|
||||
- Rich console logging with fallback to standard print
|
||||
- Structured logging (info, warning, error, success)
|
||||
- Graceful degradation when Rich library unavailable
|
||||
|
||||
### TableUtilities
|
||||
- DataFrame operations (deduplication, hashing, timestamp conversion)
|
||||
- `add_row_hash()`: Change detection
|
||||
- `save_as_table()`: Standard table save with timestamp conversion
|
||||
- `clean_date_time_columns()`: Intelligent timestamp parsing
|
||||
- `drop_duplicates_simple/advanced()`: Deduplication strategies
|
||||
- `filter_and_drop_column()`: Remove duplicate flags
|
||||
|
||||
### DAGMonitor
|
||||
- Pipeline execution tracking and reporting
|
||||
- Performance metrics and logging
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### configuration.yaml
|
||||
Central YAML configuration includes:
|
||||
- **Data Sources**: FVMS, CMS, NicheRMS table lists (`*_IN_SCOPE` variables)
|
||||
- **Azure Settings**: Storage accounts, Key Vault, Synapse workspace, subscription IDs
|
||||
- **Spark Settings**: Driver, encryption, authentication scheme
|
||||
- **Data Paths**: Local (`/workspaces/data`) vs Azure (`abfss://`)
|
||||
- **Logging**: LOG_LEVEL, LOG_ROTATION, LOG_RETENTION
|
||||
- **Nulls Handling**: STRING_NULL_REPLACEMENT, NUMERIC_NULL_REPLACEMENT, TIMESTAMP_NULL_REPLACEMENT
|
||||
|
||||
## Error Handling Strategy
|
||||
|
||||
- **Decorator-Based**: `@synapse_error_print_handler` for consistent error handling
|
||||
- **Loguru Integration**: Structured logging with proper levels
|
||||
- **Graceful Degradation**: Handle missing dependencies (Rich library fallback)
|
||||
- **Context Information**: Include table/database names in all log messages
|
||||
|
||||
## Local Data Filtering
|
||||
|
||||
`TableUtilities.save_as_table()` automatically filters to last N years when `date_created` column exists, controlled by `NUMBER_OF_YEARS` global variable in `session_optimiser.py`. Prevents full dataset processing in local development.
|
||||
|
||||
## Testing Architecture
|
||||
|
||||
### Test Structure
|
||||
- `python_files/testing/`: Unit and integration tests
|
||||
- `medallion_testing.py`: Full pipeline validation
|
||||
- `bronze_layer_validation.py`: Bronze layer tests
|
||||
- `ingestion_layer_validation.py`: Ingestion tests
|
||||
|
||||
### Testing Strategy
|
||||
- pytest integration with PySpark environments
|
||||
- Quality gates: syntax validation and linting before completion
|
||||
- Integration tests for full medallion flow
|
||||
|
||||
## DuckDB Integration
|
||||
|
||||
After running pipelines, build local DuckDB database for fast SQL analysis:
|
||||
- **File**: `/workspaces/data/warehouse.duckdb`
|
||||
- **Command**: `make build_duckdb`
|
||||
- **Purpose**: Fast local queries without Azure connection
|
||||
- **Contains**: All bronze, silver, gold layer tables
|
||||
|
||||
## Recent Architectural Changes
|
||||
|
||||
### Path Migration
|
||||
- Standardized all paths to use `unify_2_1_dm_synapse_env_d10`
|
||||
- Improved portability and environment consistency
|
||||
- 12 files updated across utilities, notebooks, configurations
|
||||
|
||||
### Code Cleanup
|
||||
- Removed unused utilities: `file_executor.py`, `file_finder.py`
|
||||
- Reduced codebase complexity
|
||||
- Regular cleanup pattern for maintainability
|
||||
247
skills/project-commands.md
Normal file
247
skills/project-commands.md
Normal file
@@ -0,0 +1,247 @@
|
||||
---
|
||||
name: project-commands
|
||||
description: Complete reference for all make commands, development workflows, Azure operations, and database operations. Use when you need to know how to run specific operations.
|
||||
---
|
||||
|
||||
# Project Commands Reference
|
||||
|
||||
Complete command reference for the Unify data migration project.
|
||||
|
||||
## Build & Test Commands
|
||||
|
||||
### Syntax Validation
|
||||
```bash
|
||||
python3 -m py_compile <file_path>
|
||||
python3 -m py_compile python_files/utilities/session_optimiser.py
|
||||
```
|
||||
|
||||
### Code Quality
|
||||
```bash
|
||||
ruff check python_files/ # Linting (must pass)
|
||||
ruff format python_files/ # Auto-format code
|
||||
```
|
||||
|
||||
### Testing
|
||||
```bash
|
||||
python -m pytest python_files/testing/ # All tests
|
||||
python -m pytest python_files/testing/medallion_testing.py # Integration
|
||||
```
|
||||
|
||||
## Pipeline Commands
|
||||
|
||||
### Complete Pipeline
|
||||
```bash
|
||||
make run_all # Executes: choice_list_mapper → bronze → silver → gold → build_duckdb
|
||||
```
|
||||
|
||||
### Layer-Specific (WARNING: Deletes existing layer data)
|
||||
```bash
|
||||
make bronze # Bronze layer pipeline (deletes /workspaces/data/bronze_*)
|
||||
make run_silver # Silver layer (includes choice_list_mapper, deletes /workspaces/data/silver_*)
|
||||
make gold # Gold layer (includes DuckDB build, deletes /workspaces/data/gold_*)
|
||||
```
|
||||
|
||||
### Specific Table Execution
|
||||
```bash
|
||||
# Run specific silver table
|
||||
make silver_table FILE_READ_LAYER=silver PATH_DATABASE=silver_fvms RUN_FILE_NAME=s_fvms_incident
|
||||
|
||||
# Run specific gold table
|
||||
make gold_table G_RUN_FILE_NAME=g_x_mg_statsclasscount
|
||||
|
||||
# Run currently open file (auto-detects layer and database)
|
||||
make current_table # Requires: make install_file_tracker (run once, then reload VSCode)
|
||||
```
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Interactive UI
|
||||
```bash
|
||||
make ui # Interactive menu for all commands
|
||||
```
|
||||
|
||||
### Data Generation
|
||||
```bash
|
||||
make generate_data # Generate synthetic test data
|
||||
```
|
||||
|
||||
## Spark Thrift Server
|
||||
|
||||
Enables JDBC/ODBC connections to local Spark data on port 10000:
|
||||
|
||||
```bash
|
||||
make thrift-start # Start server
|
||||
make thrift-status # Check if running
|
||||
make thrift-stop # Stop server
|
||||
|
||||
# Connect via spark-sql CLI
|
||||
spark-sql -e "SHOW DATABASES; SHOW TABLES;"
|
||||
spark-sql -e "SELECT * FROM gold_data_model.g_x_mg_statsclasscount LIMIT 10;"
|
||||
```
|
||||
|
||||
## Database Operations
|
||||
|
||||
### Database Inspection
|
||||
```bash
|
||||
make database-check # Check Hive databases and tables
|
||||
|
||||
# View schemas
|
||||
spark-sql -e "SHOW DATABASES; SHOW TABLES;"
|
||||
```
|
||||
|
||||
### DuckDB Operations
|
||||
```bash
|
||||
make build_duckdb # Build local DuckDB database (/workspaces/data/warehouse.duckdb)
|
||||
make harly # Open Harlequin TUI for interactive DuckDB queries
|
||||
```
|
||||
|
||||
**DuckDB Benefits**:
|
||||
- Fast local queries without Azure connection
|
||||
- Data exploration and validation
|
||||
- Report prototyping
|
||||
- Testing query logic before deploying to Synapse
|
||||
|
||||
## Azure Operations
|
||||
|
||||
### Authentication
|
||||
```bash
|
||||
make azure_login # Azure CLI login
|
||||
```
|
||||
|
||||
### SharePoint Integration
|
||||
```bash
|
||||
# Download SharePoint files
|
||||
make download_sharepoint SHAREPOINT_FILE_ID=<file-id>
|
||||
|
||||
# Convert Excel to JSON
|
||||
make convert_excel_to_json
|
||||
|
||||
# Upload to Azure Storage
|
||||
make upload_to_storage UPLOAD_FILE=<file-path>
|
||||
```
|
||||
|
||||
### Complete Pipelines
|
||||
```bash
|
||||
# Offence mapping pipeline
|
||||
make offence_mapping_build # download_sharepoint → convert_excel_to_json → upload_to_storage
|
||||
|
||||
# Table list management
|
||||
make table_lists_pipeline # download_ors_table_mapping → generate_table_lists → upload_all_table_lists
|
||||
make update_pipeline_variables # Update Azure Synapse pipeline variables
|
||||
```
|
||||
|
||||
## AI Agent Integration
|
||||
|
||||
### User Story Processing
|
||||
Automate ETL file generation from Azure DevOps user stories:
|
||||
|
||||
```bash
|
||||
make user_story_build \
|
||||
A_USER_STORY=44687 \
|
||||
A_FILE_NAME=g_x_mg_statsclasscount \
|
||||
A_READ_LAYER=silver \
|
||||
A_WRITE_LAYER=gold
|
||||
```
|
||||
|
||||
**What it does**:
|
||||
- Reads user story requirements from Azure DevOps
|
||||
- Generates ETL transformation code
|
||||
- Creates appropriate tests
|
||||
- Follows project coding standards
|
||||
|
||||
### Agent Session
|
||||
```bash
|
||||
make session # Start persistent Claude Code session with dangerously-skip-permissions
|
||||
```
|
||||
|
||||
## Git Operations
|
||||
|
||||
### Branch Merging
|
||||
```bash
|
||||
make merge_staging # Merge from staging (adds all changes, commits, pulls with --no-ff)
|
||||
make rebase_staging # Rebase from staging (adds all changes, commits, rebases)
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Required for Azure DevOps MCP
|
||||
```bash
|
||||
export AZURE_DEVOPS_PAT="<your-personal-access-token>"
|
||||
export AZURE_DEVOPS_ORGANIZATION="emstas"
|
||||
export AZURE_DEVOPS_PROJECT="Program Unify"
|
||||
```
|
||||
|
||||
### Required for Azure Operations
|
||||
See `configuration.yaml` for complete list of Azure environment variables.
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Complete Development Cycle
|
||||
```bash
|
||||
# 1. Generate test data
|
||||
make generate_data
|
||||
|
||||
# 2. Run full pipeline
|
||||
make run_all
|
||||
|
||||
# 3. Explore results
|
||||
make harly
|
||||
|
||||
# 4. Run tests
|
||||
python -m pytest python_files/testing/
|
||||
|
||||
# 5. Quality checks
|
||||
ruff check python_files/
|
||||
ruff format python_files/
|
||||
```
|
||||
|
||||
### Quick Table Development
|
||||
```bash
|
||||
# 1. Open file in VSCode
|
||||
# 2. Run current file
|
||||
make current_table
|
||||
|
||||
# 3. Check output in DuckDB
|
||||
make harly
|
||||
```
|
||||
|
||||
### Quality Gates Before Commit
|
||||
```bash
|
||||
# Must run these before committing
|
||||
python3 -m py_compile <file> # 1. Syntax check
|
||||
ruff check python_files/ # 2. Linting (must pass)
|
||||
ruff format python_files/ # 3. Format code
|
||||
```
|
||||
|
||||
## Troubleshooting Commands
|
||||
|
||||
### Check Spark Session
|
||||
```bash
|
||||
spark-sql -e "SHOW DATABASES;"
|
||||
```
|
||||
|
||||
### Verify Azure Connection
|
||||
```bash
|
||||
make azure_login
|
||||
az account show
|
||||
```
|
||||
|
||||
### Check Data Paths
|
||||
```bash
|
||||
ls -la /workspaces/data/
|
||||
```
|
||||
|
||||
## File Tracker Setup
|
||||
|
||||
One-time setup for `make current_table`:
|
||||
```bash
|
||||
make install_file_tracker
|
||||
# Then reload VSCode
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **Data Deletion**: Layer-specific commands delete existing data before running
|
||||
- **Thrift Server**: Port 10000 for JDBC/ODBC connections
|
||||
- **DuckDB**: Local analysis without Azure connection required
|
||||
- **Quality Gates**: Always run before committing code
|
||||
359
skills/pyspark-patterns.md
Normal file
359
skills/pyspark-patterns.md
Normal file
@@ -0,0 +1,359 @@
|
||||
---
|
||||
name: pyspark-patterns
|
||||
description: PySpark best practices, TableUtilities methods, ETL patterns, logging standards, and DataFrame operations for this project. Use when writing or debugging PySpark code.
|
||||
---
|
||||
|
||||
# PySpark Patterns & Best Practices
|
||||
|
||||
Comprehensive guide to PySpark patterns used in the Unify data migration project.
|
||||
|
||||
## Core Principle
|
||||
|
||||
**Always use DataFrame operations over raw SQL** when possible.
|
||||
|
||||
## TableUtilities Class Methods
|
||||
|
||||
Central utility class providing standardized DataFrame operations.
|
||||
|
||||
### add_row_hash()
|
||||
Add hash column for change detection and deduplication.
|
||||
|
||||
```python
|
||||
table_utilities = TableUtilities()
|
||||
df_with_hash = table_utilities.add_row_hash(df)
|
||||
```
|
||||
|
||||
### save_as_table()
|
||||
Standard table save with timestamp conversion and automatic filtering.
|
||||
|
||||
```python
|
||||
table_utilities.save_as_table(df, "database.table_name")
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Converts timestamp columns automatically
|
||||
- Filters to last N years when `date_created` column exists (controlled by `NUMBER_OF_YEARS`)
|
||||
- Prevents full dataset processing in local development
|
||||
|
||||
### clean_date_time_columns()
|
||||
Intelligent timestamp parsing for various date formats.
|
||||
|
||||
```python
|
||||
df_cleaned = table_utilities.clean_date_time_columns(df)
|
||||
```
|
||||
|
||||
### Deduplication Methods
|
||||
|
||||
**Simple deduplication** (all columns):
|
||||
```python
|
||||
df_deduped = table_utilities.drop_duplicates_simple(df)
|
||||
```
|
||||
|
||||
**Advanced deduplication** (specific columns, ordering):
|
||||
```python
|
||||
df_deduped = table_utilities.drop_duplicates_advanced(
|
||||
df,
|
||||
partition_columns=["id"],
|
||||
order_columns=["date_created"]
|
||||
)
|
||||
```
|
||||
|
||||
### filter_and_drop_column()
|
||||
Remove duplicate flags after processing.
|
||||
|
||||
```python
|
||||
df_filtered = table_utilities.filter_and_drop_column(df, "is_duplicate")
|
||||
```
|
||||
|
||||
### generate_deduplicate()
|
||||
Compare with existing table and identify new/changed records.
|
||||
|
||||
```python
|
||||
df_new = table_utilities.generate_deduplicate(df, "database.existing_table")
|
||||
```
|
||||
|
||||
### generate_unique_ids()
|
||||
Generate auto-incrementing unique identifiers.
|
||||
|
||||
```python
|
||||
df_with_id = table_utilities.generate_unique_ids(df, "unique_id_column_name")
|
||||
```
|
||||
|
||||
## ETL Class Pattern
|
||||
|
||||
All silver and gold transformations follow this standardized pattern:
|
||||
|
||||
```python
|
||||
class TableName:
|
||||
def __init__(self, bronze_table_name: str):
|
||||
self.bronze_table_name = bronze_table_name
|
||||
self.silver_database_name = f"silver_{self.bronze_table_name.split('.')[0].split('_')[-1]}"
|
||||
self.silver_table_name = self.bronze_table_name.split(".")[-1].replace("b_", "s_")
|
||||
|
||||
# Execute ETL pipeline
|
||||
self.extract_sdf = self.extract()
|
||||
self.transform_sdf = self.transform()
|
||||
self.load()
|
||||
|
||||
@synapse_error_print_handler
|
||||
def extract(self) -> DataFrame:
|
||||
"""Extract data from source tables."""
|
||||
logger.info(f"Extracting from {self.bronze_table_name}")
|
||||
df = spark.table(self.bronze_table_name)
|
||||
logger.success(f"Extracted {df.count()} records")
|
||||
return df
|
||||
|
||||
@synapse_error_print_handler
|
||||
def transform(self) -> DataFrame:
|
||||
"""Transform data according to business rules."""
|
||||
logger.info("Starting transformation")
|
||||
# Apply transformations
|
||||
transformed_df = self.extract_sdf.filter(...).select(...)
|
||||
logger.success("Transformation complete")
|
||||
return transformed_df
|
||||
|
||||
@synapse_error_print_handler
|
||||
def load(self) -> None:
|
||||
"""Load data to target table."""
|
||||
logger.info(f"Loading to {self.silver_database_name}.{self.silver_table_name}")
|
||||
table_utilities.save_as_table(
|
||||
self.transform_sdf,
|
||||
f"{self.silver_database_name}.{self.silver_table_name}"
|
||||
)
|
||||
logger.success(f"Successfully loaded {self.silver_table_name}")
|
||||
|
||||
|
||||
# Instantiate with exception handling
|
||||
try:
|
||||
TableName("bronze_database.b_table_name")
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing TableName: {str(e)}")
|
||||
raise e
|
||||
```
|
||||
|
||||
## Logging Standards
|
||||
|
||||
### Use NotebookLogger (Never print())
|
||||
|
||||
```python
|
||||
from utilities.session_optimiser import NotebookLogger
|
||||
|
||||
logger = NotebookLogger()
|
||||
|
||||
# Log levels
|
||||
logger.info("Starting process") # Informational messages
|
||||
logger.warning("Potential issue detected") # Warnings
|
||||
logger.error("Operation failed") # Errors
|
||||
logger.success("Process completed") # Success messages
|
||||
```
|
||||
|
||||
### Logging Best Practices
|
||||
|
||||
1. **Always include table/database names**:
|
||||
```python
|
||||
logger.info(f"Processing table {database}.{table}")
|
||||
```
|
||||
|
||||
2. **Log at key milestones**:
|
||||
```python
|
||||
logger.info("Starting extraction")
|
||||
# ... extraction code
|
||||
logger.success("Extraction complete")
|
||||
```
|
||||
|
||||
3. **Include counts and metrics**:
|
||||
```python
|
||||
logger.info(f"Extracted {df.count()} records from {table}")
|
||||
```
|
||||
|
||||
4. **Error context**:
|
||||
```python
|
||||
logger.error(f"Failed to process {table}: {str(e)}")
|
||||
```
|
||||
|
||||
## Error Handling Pattern
|
||||
|
||||
### @synapse_error_print_handler Decorator
|
||||
|
||||
Wrap ALL processing functions with this decorator:
|
||||
|
||||
```python
|
||||
from utilities.session_optimiser import synapse_error_print_handler
|
||||
|
||||
@synapse_error_print_handler
|
||||
def extract(self) -> DataFrame:
|
||||
# Your code here
|
||||
return df
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Consistent error handling across codebase
|
||||
- Automatic error logging
|
||||
- Graceful error propagation
|
||||
|
||||
### Exception Handling at Instantiation
|
||||
|
||||
```python
|
||||
try:
|
||||
MyETLClass("source_table")
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing MyETLClass: {str(e)}")
|
||||
raise e
|
||||
```
|
||||
|
||||
## DataFrame Operations Patterns
|
||||
|
||||
### Filtering
|
||||
```python
|
||||
# Use col() for clarity
|
||||
from pyspark.sql.functions import col
|
||||
|
||||
df_filtered = df.filter(col("status") == "active")
|
||||
df_filtered = df.filter((col("age") > 18) & (col("country") == "AU"))
|
||||
```
|
||||
|
||||
### Selecting and Aliasing
|
||||
```python
|
||||
from pyspark.sql.functions import col, lit
|
||||
|
||||
df_selected = df.select(
|
||||
col("id"),
|
||||
col("name").alias("person_name"),
|
||||
lit("constant_value").alias("constant_column")
|
||||
)
|
||||
```
|
||||
|
||||
### Joins
|
||||
```python
|
||||
# Always use explicit join keys and type
|
||||
df_joined = df1.join(
|
||||
df2,
|
||||
df1["id"] == df2["person_id"],
|
||||
"inner" # inner, left, right, outer
|
||||
)
|
||||
|
||||
# Drop duplicate columns after join
|
||||
df_joined = df_joined.drop(df2["person_id"])
|
||||
```
|
||||
|
||||
### Window Functions
|
||||
```python
|
||||
from pyspark.sql import Window
|
||||
from pyspark.sql.functions import row_number, rank, dense_rank
|
||||
|
||||
window_spec = Window.partitionBy("category").orderBy(col("date").desc())
|
||||
|
||||
df_windowed = df.withColumn(
|
||||
"row_num",
|
||||
row_number().over(window_spec)
|
||||
).filter(col("row_num") == 1)
|
||||
```
|
||||
|
||||
### Aggregations
|
||||
```python
|
||||
from pyspark.sql.functions import sum, avg, count, max, min
|
||||
|
||||
df_agg = df.groupBy("category").agg(
|
||||
count("*").alias("total_count"),
|
||||
sum("amount").alias("total_amount"),
|
||||
avg("amount").alias("avg_amount")
|
||||
)
|
||||
```
|
||||
|
||||
## JDBC Connection Pattern
|
||||
|
||||
```python
|
||||
def get_connection_properties() -> dict:
|
||||
"""Get JDBC connection properties."""
|
||||
return {
|
||||
"user": os.getenv("DB_USER"),
|
||||
"password": os.getenv("DB_PASSWORD"),
|
||||
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
||||
}
|
||||
|
||||
# Use for JDBC reads
|
||||
df = spark.read.jdbc(
|
||||
url=jdbc_url,
|
||||
table="schema.table",
|
||||
properties=get_connection_properties()
|
||||
)
|
||||
```
|
||||
|
||||
## Session Management
|
||||
|
||||
### Get Optimized Spark Session
|
||||
```python
|
||||
from utilities.session_optimiser import SparkOptimiser
|
||||
|
||||
spark = SparkOptimiser.get_optimised_spark_session()
|
||||
```
|
||||
|
||||
### Reset Spark Context
|
||||
```python
|
||||
table_utilities.reset_spark_context()
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Memory issues
|
||||
- Multiple Spark sessions
|
||||
- After large operations
|
||||
|
||||
## Memory Management
|
||||
|
||||
### Caching
|
||||
```python
|
||||
# Cache frequently accessed DataFrames
|
||||
df_cached = df.cache()
|
||||
|
||||
# Unpersist when done
|
||||
df_cached.unpersist()
|
||||
```
|
||||
|
||||
### Partitioning
|
||||
```python
|
||||
# Repartition for better parallelism
|
||||
df_repartitioned = df.repartition(10)
|
||||
|
||||
# Coalesce to reduce partitions
|
||||
df_coalesced = df.coalesce(1)
|
||||
```
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
1. **Don't use print() statements** - Use logger methods
|
||||
2. **Don't read entire tables without filtering** - Filter early
|
||||
3. **Don't create DataFrames inside loops** - Collect and batch
|
||||
4. **Don't use collect() on large DataFrames** - Process distributedly
|
||||
5. **Don't forget to unpersist cached DataFrames** - Memory leaks
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Filter early**: Reduce data volume ASAP
|
||||
2. **Use broadcast for small tables**: Optimize joins
|
||||
3. **Partition strategically**: Balance parallelism
|
||||
4. **Cache wisely**: Only for reused DataFrames
|
||||
5. **Use window functions**: Instead of self-joins
|
||||
|
||||
## Code Quality Standards
|
||||
|
||||
### Type Hints
|
||||
```python
|
||||
from pyspark.sql import DataFrame
|
||||
|
||||
def process_data(df: DataFrame, table_name: str) -> DataFrame:
|
||||
return df.filter(col("active") == True)
|
||||
```
|
||||
|
||||
### Line Length
|
||||
**Maximum: 240 characters** (not standard 88/120)
|
||||
|
||||
### Blank Lines
|
||||
**No blank lines inside functions** - Keep functions compact
|
||||
|
||||
### Imports
|
||||
All imports at top of file, never inside functions
|
||||
```python
|
||||
from pyspark.sql import DataFrame
|
||||
from pyspark.sql.functions import col, lit, when
|
||||
from utilities.session_optimiser import TableUtilities, NotebookLogger
|
||||
```
|
||||
338
skills/schema-reference.md
Executable file
338
skills/schema-reference.md
Executable file
@@ -0,0 +1,338 @@
|
||||
---
|
||||
name: schema-reference
|
||||
description: Automatically reference and validate schemas from both legacy data sources and medallion layer data sources (bronze, silver, gold) before generating PySpark transformation code. This skill should be used proactively whenever PySpark ETL code generation is requested, ensuring accurate column names, data types, business logic, and cross-layer mappings are incorporated into the code.
|
||||
---
|
||||
|
||||
# Schema Reference
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides comprehensive schema reference capabilities for the medallion architecture data lake. It automatically queries DuckDB warehouse, parses data dictionary files, and extracts business logic before generating PySpark transformation code. This ensures all generated code uses correct column names, data types, relationships, and business rules.
|
||||
|
||||
**Use this skill proactively before generating any PySpark transformation code to avoid schema errors and ensure business logic compliance.**
|
||||
|
||||
## Workflow
|
||||
|
||||
When generating PySpark transformation code, follow this workflow:
|
||||
|
||||
### 1. Identify Source and Target Tables
|
||||
|
||||
Determine which tables are involved in the transformation:
|
||||
- **Bronze Layer**: Raw ingestion tables (e.g., `bronze_cms.b_cms_case`)
|
||||
- **Silver Layer**: Validated tables (e.g., `silver_cms.s_cms_case`)
|
||||
- **Gold Layer**: Analytical tables (e.g., `gold_data_model.g_x_mg_statsclasscount`)
|
||||
|
||||
### 2. Query Source Schema
|
||||
|
||||
Use `scripts/query_duckdb_schema.py` to get actual column names and data types from DuckDB:
|
||||
|
||||
```bash
|
||||
python scripts/query_duckdb_schema.py \
|
||||
--database bronze_cms \
|
||||
--table b_cms_case
|
||||
```
|
||||
|
||||
This returns:
|
||||
- Column names (exact spelling and case)
|
||||
- Data types (BIGINT, VARCHAR, TIMESTAMP, etc.)
|
||||
- Nullable constraints
|
||||
- Row count
|
||||
|
||||
**When to use**:
|
||||
- Before reading from any table
|
||||
- To verify column existence
|
||||
- To understand data types for casting operations
|
||||
- To check if table exists in warehouse
|
||||
|
||||
### 3. Extract Business Logic from Data Dictionary
|
||||
|
||||
Use `scripts/extract_data_dictionary.py` to read business rules and constraints:
|
||||
|
||||
```bash
|
||||
python scripts/extract_data_dictionary.py cms_case
|
||||
```
|
||||
|
||||
This returns:
|
||||
- Column descriptions with business context
|
||||
- Primary and foreign key relationships
|
||||
- Default values and common patterns
|
||||
- Data quality rules (e.g., "treat value 1 as NULL")
|
||||
- Validation constraints
|
||||
|
||||
**When to use**:
|
||||
- Before implementing transformations
|
||||
- To understand foreign key relationships for joins
|
||||
- To identify default values and data quality rules
|
||||
- To extract business logic that must be implemented
|
||||
|
||||
### 4. Compare Schemas Between Layers
|
||||
|
||||
Use `scripts/schema_comparison.py` to identify transformations needed:
|
||||
|
||||
```bash
|
||||
python scripts/schema_comparison.py \
|
||||
--source-db bronze_cms --source-table b_cms_case \
|
||||
--target-db silver_cms --target-table s_cms_case
|
||||
```
|
||||
|
||||
This returns:
|
||||
- Common columns between layers
|
||||
- Columns only in source (need to be dropped or transformed)
|
||||
- Columns only in target (need to be created)
|
||||
- Inferred column mappings (e.g., `cms_case_id` → `s_cms_case_id`)
|
||||
|
||||
**When to use**:
|
||||
- When transforming data between layers
|
||||
- To identify required column renaming
|
||||
- To discover missing columns that need to be added
|
||||
- To validate transformation completeness
|
||||
|
||||
### 5. Reference Schema Mapping Conventions
|
||||
|
||||
Read `references/schema_mapping_conventions.md` for layer-specific naming patterns:
|
||||
- How primary keys are renamed across layers
|
||||
- Foreign key consistency rules
|
||||
- Junction table naming conventions
|
||||
- Legacy warehouse mapping
|
||||
|
||||
**When to use**:
|
||||
- When uncertain about naming conventions
|
||||
- When working with cross-layer joins
|
||||
- When mapping to legacy warehouse schema
|
||||
|
||||
### 6. Reference Business Logic Patterns
|
||||
|
||||
Read `references/business_logic_patterns.md` for common transformation patterns:
|
||||
- Extracting business logic from data dictionaries
|
||||
- Choice list mapping (enum resolution)
|
||||
- Deduplication strategies
|
||||
- Cross-source joins
|
||||
- Conditional logic implementation
|
||||
- Aggregation with business rules
|
||||
|
||||
**When to use**:
|
||||
- When implementing business rules from data dictionaries
|
||||
- When applying standard transformations (deduplication, timestamp standardization)
|
||||
- When creating gold layer analytical tables
|
||||
- When uncertain how to implement a business rule
|
||||
|
||||
### 7. Generate PySpark Code
|
||||
|
||||
With schema and business logic information gathered, generate PySpark transformation code following the ETL class pattern:
|
||||
|
||||
```python
|
||||
class TableName:
|
||||
def __init__(self, bronze_table_name: str):
|
||||
self.bronze_table_name = bronze_table_name
|
||||
self.silver_database_name = f"silver_{self.bronze_table_name.split('.')[0].split('_')[-1]}"
|
||||
self.silver_table_name = self.bronze_table_name.split(".")[-1].replace("b_", "s_")
|
||||
self.extract_sdf = self.extract()
|
||||
self.transform_sdf = self.transform()
|
||||
self.load()
|
||||
|
||||
@synapse_error_print_handler
|
||||
def extract(self):
|
||||
logger.info(f"Extracting {self.bronze_table_name}")
|
||||
return spark.table(self.bronze_table_name)
|
||||
|
||||
@synapse_error_print_handler
|
||||
def transform(self):
|
||||
logger.info(f"Transforming {self.silver_table_name}")
|
||||
sdf = self.extract_sdf
|
||||
# Apply transformations based on schema and business logic
|
||||
# 1. Rename primary key (from schema comparison)
|
||||
# 2. Apply data quality rules (from data dictionary)
|
||||
# 3. Standardize timestamps (from schema)
|
||||
# 4. Deduplicate (based on business rules)
|
||||
# 5. Add row hash (standard practice)
|
||||
return sdf
|
||||
|
||||
@synapse_error_print_handler
|
||||
def load(self):
|
||||
logger.info(f"Loading {self.silver_database_name}.{self.silver_table_name}")
|
||||
TableUtilities.save_as_table(
|
||||
sdf=self.transform_sdf,
|
||||
table_name=self.silver_table_name,
|
||||
database_name=self.silver_database_name
|
||||
)
|
||||
logger.success(f"Successfully loaded {self.silver_database_name}.{self.silver_table_name}")
|
||||
```
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### List All Tables
|
||||
|
||||
See all available tables in DuckDB warehouse:
|
||||
|
||||
```bash
|
||||
python scripts/query_duckdb_schema.py --list
|
||||
python scripts/query_duckdb_schema.py --list --database silver_cms
|
||||
```
|
||||
|
||||
### Common Use Cases
|
||||
|
||||
**Use Case 1: Creating a Silver Table from Bronze**
|
||||
|
||||
```bash
|
||||
# 1. Check bronze schema
|
||||
python scripts/query_duckdb_schema.py --database bronze_cms --table b_cms_case
|
||||
|
||||
# 2. Get business logic
|
||||
python scripts/extract_data_dictionary.py cms_case
|
||||
|
||||
# 3. Compare with existing silver (if updating)
|
||||
python scripts/schema_comparison.py \
|
||||
--source-db bronze_cms --source-table b_cms_case \
|
||||
--target-db silver_cms --target-table s_cms_case
|
||||
|
||||
# 4. Generate PySpark code with correct schema and business logic
|
||||
```
|
||||
|
||||
**Use Case 2: Creating a Gold Table from Multiple Silver Tables**
|
||||
|
||||
```bash
|
||||
# 1. Check each silver table schema
|
||||
python scripts/query_duckdb_schema.py --database silver_cms --table s_cms_case
|
||||
python scripts/query_duckdb_schema.py --database silver_fvms --table s_fvms_incident
|
||||
|
||||
# 2. Get business logic for each source
|
||||
python scripts/extract_data_dictionary.py cms_case
|
||||
python scripts/extract_data_dictionary.py fvms_incident
|
||||
|
||||
# 3. Identify join keys from foreign key relationships in data dictionaries
|
||||
|
||||
# 4. Generate PySpark code with cross-source joins
|
||||
```
|
||||
|
||||
**Use Case 3: Updating an Existing Transformation**
|
||||
|
||||
```bash
|
||||
# 1. Compare current schemas
|
||||
python scripts/schema_comparison.py \
|
||||
--source-db bronze_cms --source-table b_cms_case \
|
||||
--target-db silver_cms --target-table s_cms_case
|
||||
|
||||
# 2. Identify new columns or changed business logic
|
||||
python scripts/extract_data_dictionary.py cms_case
|
||||
|
||||
# 3. Update PySpark code accordingly
|
||||
```
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
User requests PySpark code generation
|
||||
|
|
||||
v
|
||||
[Skill Activated]
|
||||
|
|
||||
v
|
||||
What layer transformation?
|
||||
|
|
||||
+----+----+----+
|
||||
| | | |
|
||||
Bronze Silver Gold Other
|
||||
| | | |
|
||||
v v v v
|
||||
Query schema for all involved tables
|
||||
|
|
||||
v
|
||||
Extract business logic from data dictionaries
|
||||
|
|
||||
v
|
||||
Compare schemas if transforming between layers
|
||||
|
|
||||
v
|
||||
Reference mapping conventions and business logic patterns
|
||||
|
|
||||
v
|
||||
Generate PySpark code with:
|
||||
- Correct column names
|
||||
- Proper data types
|
||||
- Business logic implemented
|
||||
- Standard error handling
|
||||
- Proper logging
|
||||
```
|
||||
|
||||
## Key Principles
|
||||
|
||||
1. **Always verify schemas first**: Never assume column names or types without querying
|
||||
|
||||
2. **Extract business logic from data dictionaries**: Business rules must be implemented, not guessed
|
||||
|
||||
3. **Follow naming conventions**: Use schema mapping conventions for layer-specific prefixes
|
||||
|
||||
4. **Use TableUtilities**: Leverage existing utility methods for common operations
|
||||
|
||||
5. **Apply standard patterns**: Follow the ETL class pattern and use standard decorators
|
||||
|
||||
6. **Log comprehensively**: Include table/database names in all log messages
|
||||
|
||||
7. **Handle errors gracefully**: Use `@synapse_error_print_handler` decorator
|
||||
|
||||
## Environment Setup
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- DuckDB warehouse must exist at `/workspaces/data/warehouse.duckdb`
|
||||
- Data dictionary files must exist at `.claude/data_dictionary/`
|
||||
- Python packages: `duckdb` (for schema querying)
|
||||
|
||||
### Verify Setup
|
||||
|
||||
```bash
|
||||
# Check DuckDB warehouse exists
|
||||
ls -la /workspaces/data/warehouse.duckdb
|
||||
|
||||
# Check data dictionary exists
|
||||
ls -la .claude/data_dictionary/
|
||||
|
||||
# Build DuckDB warehouse if missing
|
||||
make build_duckdb
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
### scripts/
|
||||
|
||||
This skill includes three Python scripts for schema querying and analysis:
|
||||
|
||||
**`query_duckdb_schema.py`**
|
||||
- Query DuckDB warehouse for table schemas
|
||||
- List all tables in a database or across all databases
|
||||
- Get column names, data types, nullability, and row counts
|
||||
- Executable without loading into context
|
||||
|
||||
**`extract_data_dictionary.py`**
|
||||
- Parse data dictionary markdown files
|
||||
- Extract schema information, business logic, and constraints
|
||||
- Show primary key and foreign key relationships
|
||||
- Identify default values and data quality rules
|
||||
|
||||
**`schema_comparison.py`**
|
||||
- Compare schemas between layers (bronze → silver → gold)
|
||||
- Identify common columns, source-only columns, target-only columns
|
||||
- Infer column mappings based on naming conventions
|
||||
- Validate transformation completeness
|
||||
|
||||
### references/
|
||||
|
||||
This skill includes two reference documents for detailed guidance:
|
||||
|
||||
**`schema_mapping_conventions.md`**
|
||||
- Medallion architecture layer structure and conventions
|
||||
- Primary key and foreign key naming patterns
|
||||
- Table naming conventions across layers
|
||||
- Legacy warehouse mapping rules
|
||||
- Common transformation patterns between layers
|
||||
|
||||
**`business_logic_patterns.md`**
|
||||
- How to extract business logic from data dictionary descriptions
|
||||
- Common transformation patterns (deduplication, choice lists, timestamps)
|
||||
- ETL class pattern implementation with business logic
|
||||
- Testing business logic before deployment
|
||||
- Logging and error handling best practices
|
||||
|
||||
---
|
||||
|
||||
**Note**: This skill automatically activates when PySpark transformation code generation is requested. Scripts are used as needed to query schemas and extract business logic before code generation.
|
||||
209
skills/skill-creator.md
Executable file
209
skills/skill-creator.md
Executable file
@@ -0,0 +1,209 @@
|
||||
---
|
||||
name: skill-creator
|
||||
description: Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
|
||||
license: Complete terms in LICENSE.txt
|
||||
---
|
||||
|
||||
# Skill Creator
|
||||
|
||||
This skill provides guidance for creating effective skills.
|
||||
|
||||
## About Skills
|
||||
|
||||
Skills are modular, self-contained packages that extend Claude's capabilities by providing
|
||||
specialized knowledge, workflows, and tools. Think of them as "onboarding guides" for specific
|
||||
domains or tasks—they transform Claude from a general-purpose agent into a specialized agent
|
||||
equipped with procedural knowledge that no model can fully possess.
|
||||
|
||||
### What Skills Provide
|
||||
|
||||
1. Specialized workflows - Multi-step procedures for specific domains
|
||||
2. Tool integrations - Instructions for working with specific file formats or APIs
|
||||
3. Domain expertise - Company-specific knowledge, schemas, business logic
|
||||
4. Bundled resources - Scripts, references, and assets for complex and repetitive tasks
|
||||
|
||||
### Anatomy of a Skill
|
||||
|
||||
Every skill consists of a required SKILL.md file and optional bundled resources:
|
||||
|
||||
```
|
||||
skill-name/
|
||||
├── SKILL.md (required)
|
||||
│ ├── YAML frontmatter metadata (required)
|
||||
│ │ ├── name: (required)
|
||||
│ │ └── description: (required)
|
||||
│ └── Markdown instructions (required)
|
||||
└── Bundled Resources (optional)
|
||||
├── scripts/ - Executable code (Python/Bash/etc.)
|
||||
├── references/ - Documentation intended to be loaded into context as needed
|
||||
└── assets/ - Files used in output (templates, icons, fonts, etc.)
|
||||
```
|
||||
|
||||
#### SKILL.md (required)
|
||||
|
||||
**Metadata Quality:** The `name` and `description` in YAML frontmatter determine when Claude will use the skill. Be specific about what the skill does and when to use it. Use the third-person (e.g. "This skill should be used when..." instead of "Use this skill when...").
|
||||
|
||||
#### Bundled Resources (optional)
|
||||
|
||||
##### Scripts (`scripts/`)
|
||||
|
||||
Executable code (Python/Bash/etc.) for tasks that require deterministic reliability or are repeatedly rewritten.
|
||||
|
||||
- **When to include**: When the same code is being rewritten repeatedly or deterministic reliability is needed
|
||||
- **Example**: `scripts/rotate_pdf.py` for PDF rotation tasks
|
||||
- **Benefits**: Token efficient, deterministic, may be executed without loading into context
|
||||
- **Note**: Scripts may still need to be read by Claude for patching or environment-specific adjustments
|
||||
|
||||
##### References (`references/`)
|
||||
|
||||
Documentation and reference material intended to be loaded as needed into context to inform Claude's process and thinking.
|
||||
|
||||
- **When to include**: For documentation that Claude should reference while working
|
||||
- **Examples**: `references/finance.md` for financial schemas, `references/mnda.md` for company NDA template, `references/policies.md` for company policies, `references/api_docs.md` for API specifications
|
||||
- **Use cases**: Database schemas, API documentation, domain knowledge, company policies, detailed workflow guides
|
||||
- **Benefits**: Keeps SKILL.md lean, loaded only when Claude determines it's needed
|
||||
- **Best practice**: If files are large (>10k words), include grep search patterns in SKILL.md
|
||||
- **Avoid duplication**: Information should live in either SKILL.md or references files, not both. Prefer references files for detailed information unless it's truly core to the skill—this keeps SKILL.md lean while making information discoverable without hogging the context window. Keep only essential procedural instructions and workflow guidance in SKILL.md; move detailed reference material, schemas, and examples to references files.
|
||||
|
||||
##### Assets (`assets/`)
|
||||
|
||||
Files not intended to be loaded into context, but rather used within the output Claude produces.
|
||||
|
||||
- **When to include**: When the skill needs files that will be used in the final output
|
||||
- **Examples**: `assets/logo.png` for brand assets, `assets/slides.pptx` for PowerPoint templates, `assets/frontend-template/` for HTML/React boilerplate, `assets/font.ttf` for typography
|
||||
- **Use cases**: Templates, images, icons, boilerplate code, fonts, sample documents that get copied or modified
|
||||
- **Benefits**: Separates output resources from documentation, enables Claude to use files without loading them into context
|
||||
|
||||
### Progressive Disclosure Design Principle
|
||||
|
||||
Skills use a three-level loading system to manage context efficiently:
|
||||
|
||||
1. **Metadata (name + description)** - Always in context (~100 words)
|
||||
2. **SKILL.md body** - When skill triggers (<5k words)
|
||||
3. **Bundled resources** - As needed by Claude (Unlimited*)
|
||||
|
||||
*Unlimited because scripts can be executed without reading into context window.
|
||||
|
||||
## Skill Creation Process
|
||||
|
||||
To create a skill, follow the "Skill Creation Process" in order, skipping steps only if there is a clear reason why they are not applicable.
|
||||
|
||||
### Step 1: Understanding the Skill with Concrete Examples
|
||||
|
||||
Skip this step only when the skill's usage patterns are already clearly understood. It remains valuable even when working with an existing skill.
|
||||
|
||||
To create an effective skill, clearly understand concrete examples of how the skill will be used. This understanding can come from either direct user examples or generated examples that are validated with user feedback.
|
||||
|
||||
For example, when building an image-editor skill, relevant questions include:
|
||||
|
||||
- "What functionality should the image-editor skill support? Editing, rotating, anything else?"
|
||||
- "Can you give some examples of how this skill would be used?"
|
||||
- "I can imagine users asking for things like 'Remove the red-eye from this image' or 'Rotate this image'. Are there other ways you imagine this skill being used?"
|
||||
- "What would a user say that should trigger this skill?"
|
||||
|
||||
To avoid overwhelming users, avoid asking too many questions in a single message. Start with the most important questions and follow up as needed for better effectiveness.
|
||||
|
||||
Conclude this step when there is a clear sense of the functionality the skill should support.
|
||||
|
||||
### Step 2: Planning the Reusable Skill Contents
|
||||
|
||||
To turn concrete examples into an effective skill, analyze each example by:
|
||||
|
||||
1. Considering how to execute on the example from scratch
|
||||
2. Identifying what scripts, references, and assets would be helpful when executing these workflows repeatedly
|
||||
|
||||
Example: When building a `pdf-editor` skill to handle queries like "Help me rotate this PDF," the analysis shows:
|
||||
|
||||
1. Rotating a PDF requires re-writing the same code each time
|
||||
2. A `scripts/rotate_pdf.py` script would be helpful to store in the skill
|
||||
|
||||
Example: When designing a `frontend-webapp-builder` skill for queries like "Build me a todo app" or "Build me a dashboard to track my steps," the analysis shows:
|
||||
|
||||
1. Writing a frontend webapp requires the same boilerplate HTML/React each time
|
||||
2. An `assets/hello-world/` template containing the boilerplate HTML/React project files would be helpful to store in the skill
|
||||
|
||||
Example: When building a `big-query` skill to handle queries like "How many users have logged in today?" the analysis shows:
|
||||
|
||||
1. Querying BigQuery requires re-discovering the table schemas and relationships each time
|
||||
2. A `references/schema.md` file documenting the table schemas would be helpful to store in the skill
|
||||
|
||||
To establish the skill's contents, analyze each concrete example to create a list of the reusable resources to include: scripts, references, and assets.
|
||||
|
||||
### Step 3: Initializing the Skill
|
||||
|
||||
At this point, it is time to actually create the skill.
|
||||
|
||||
Skip this step only if the skill being developed already exists, and iteration or packaging is needed. In this case, continue to the next step.
|
||||
|
||||
When creating a new skill from scratch, always run the `init_skill.py` script. The script conveniently generates a new template skill directory that automatically includes everything a skill requires, making the skill creation process much more efficient and reliable.
|
||||
|
||||
Usage:
|
||||
|
||||
```bash
|
||||
scripts/init_skill.py <skill-name> --path <output-directory>
|
||||
```
|
||||
|
||||
The script:
|
||||
|
||||
- Creates the skill directory at the specified path
|
||||
- Generates a SKILL.md template with proper frontmatter and TODO placeholders
|
||||
- Creates example resource directories: `scripts/`, `references/`, and `assets/`
|
||||
- Adds example files in each directory that can be customized or deleted
|
||||
|
||||
After initialization, customize or remove the generated SKILL.md and example files as needed.
|
||||
|
||||
### Step 4: Edit the Skill
|
||||
|
||||
When editing the (newly-generated or existing) skill, remember that the skill is being created for another instance of Claude to use. Focus on including information that would be beneficial and non-obvious to Claude. Consider what procedural knowledge, domain-specific details, or reusable assets would help another Claude instance execute these tasks more effectively.
|
||||
|
||||
#### Start with Reusable Skill Contents
|
||||
|
||||
To begin implementation, start with the reusable resources identified above: `scripts/`, `references/`, and `assets/` files. Note that this step may require user input. For example, when implementing a `brand-guidelines` skill, the user may need to provide brand assets or templates to store in `assets/`, or documentation to store in `references/`.
|
||||
|
||||
Also, delete any example files and directories not needed for the skill. The initialization script creates example files in `scripts/`, `references/`, and `assets/` to demonstrate structure, but most skills won't need all of them.
|
||||
|
||||
#### Update SKILL.md
|
||||
|
||||
**Writing Style:** Write the entire skill using **imperative/infinitive form** (verb-first instructions), not second person. Use objective, instructional language (e.g., "To accomplish X, do Y" rather than "You should do X" or "If you need to do X"). This maintains consistency and clarity for AI consumption.
|
||||
|
||||
To complete SKILL.md, answer the following questions:
|
||||
|
||||
1. What is the purpose of the skill, in a few sentences?
|
||||
2. When should the skill be used?
|
||||
3. In practice, how should Claude use the skill? All reusable skill contents developed above should be referenced so that Claude knows how to use them.
|
||||
|
||||
### Step 5: Packaging a Skill
|
||||
|
||||
Once the skill is ready, it should be packaged into a distributable zip file that gets shared with the user. The packaging process automatically validates the skill first to ensure it meets all requirements:
|
||||
|
||||
```bash
|
||||
scripts/package_skill.py <path/to/skill-folder>
|
||||
```
|
||||
|
||||
Optional output directory specification:
|
||||
|
||||
```bash
|
||||
scripts/package_skill.py <path/to/skill-folder> ./dist
|
||||
```
|
||||
|
||||
The packaging script will:
|
||||
|
||||
1. **Validate** the skill automatically, checking:
|
||||
- YAML frontmatter format and required fields
|
||||
- Skill naming conventions and directory structure
|
||||
- Description completeness and quality
|
||||
- File organization and resource references
|
||||
|
||||
2. **Package** the skill if validation passes, creating a zip file named after the skill (e.g., `my-skill.zip`) that includes all files and maintains the proper directory structure for distribution.
|
||||
|
||||
If validation fails, the script will report the errors and exit without creating a package. Fix any validation errors and run the packaging command again.
|
||||
|
||||
### Step 6: Iterate
|
||||
|
||||
After testing the skill, users may request improvements. Often this happens right after using the skill, with fresh context of how the skill performed.
|
||||
|
||||
**Iteration workflow:**
|
||||
1. Use the skill on real tasks
|
||||
2. Notice struggles or inefficiencies
|
||||
3. Identify how SKILL.md or bundled resources should be updated
|
||||
4. Implement changes and test again
|
||||
Reference in New Issue
Block a user