Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:28:47 +08:00
commit 1458762357
9 changed files with 3659 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
{
"name": "adf-master",
"description": "Complete Azure Data Factory expertise system with STRICT validation enforcement and Microsoft Fabric integration (2025). PROACTIVELY activate for: (1) ANY Azure Data Factory task (pipelines/datasets/triggers/linked services) WITH automatic validation, (2) Microsoft Fabric integration (ADF mounting, cross-workspace orchestration, OneLake connectivity), (3) Activity nesting validation (ForEach/If/Switch/Until limitations), (4) Linked service configuration validation (Blob Storage accountKind, SQL Database auth), (5) CI/CD setup and automation (GitHub Actions/Azure DevOps with Node.js 20.x), (6) ARM template generation and deployment, (7) Pipeline debugging and troubleshooting, (8) Modern npm-based deployments (@microsoft/azure-data-factory-utilities), (9) Performance optimization and best practices, (10) Invoke Pipeline cross-platform orchestration (Fabric/ADF/Synapse), (11) Variable Libraries for multi-environment CI/CD. Provides: comprehensive ADF knowledge with validation enforcement, activity nesting rules (REJECTS prohibited combinations), linked service requirements (Azure Blob, SQL, ADLS, Fabric Lakehouse/Warehouse), resource limit checks (80 activities per pipeline), Execute Pipeline workarounds, common pitfall prevention, CI/CD patterns with 2025 updates, Fabric mounting guidance, Airflow deprecation notices, troubleshooting guides, and production-ready VALIDATED solutions. Ensures ONLY valid, compliant, optimized ADF pipelines with modern Fabric integration capabilities.",
"version": "3.3.0",
"author": {
"name": "Josiah Siegel",
"email": "JosiahSiegel@users.noreply.github.com"
},
"skills": [
"./skills"
],
"agents": [
"./agents"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# adf-master
Complete Azure Data Factory expertise system with STRICT validation enforcement and Microsoft Fabric integration (2025). PROACTIVELY activate for: (1) ANY Azure Data Factory task (pipelines/datasets/triggers/linked services) WITH automatic validation, (2) Microsoft Fabric integration (ADF mounting, cross-workspace orchestration, OneLake connectivity), (3) Activity nesting validation (ForEach/If/Switch/Until limitations), (4) Linked service configuration validation (Blob Storage accountKind, SQL Database auth), (5) CI/CD setup and automation (GitHub Actions/Azure DevOps with Node.js 20.x), (6) ARM template generation and deployment, (7) Pipeline debugging and troubleshooting, (8) Modern npm-based deployments (@microsoft/azure-data-factory-utilities), (9) Performance optimization and best practices, (10) Invoke Pipeline cross-platform orchestration (Fabric/ADF/Synapse), (11) Variable Libraries for multi-environment CI/CD. Provides: comprehensive ADF knowledge with validation enforcement, activity nesting rules (REJECTS prohibited combinations), linked service requirements (Azure Blob, SQL, ADLS, Fabric Lakehouse/Warehouse), resource limit checks (80 activities per pipeline), Execute Pipeline workarounds, common pitfall prevention, CI/CD patterns with 2025 updates, Fabric mounting guidance, Airflow deprecation notices, troubleshooting guides, and production-ready VALIDATED solutions. Ensures ONLY valid, compliant, optimized ADF pipelines with modern Fabric integration capabilities.

282
agents/adf-expert.md Normal file
View File

@@ -0,0 +1,282 @@
---
agent: true
description: Complete Azure Data Factory expertise system. PROACTIVELY activate for: (1) ANY Azure Data Factory task (pipelines/datasets/triggers/linked services), (2) Pipeline design and architecture, (3) Data transformation logic, (4) Performance troubleshooting, (5) Best practices guidance, (6) Resource configuration, (7) Integration runtime setup, (8) Data flow creation. Provides: comprehensive ADF knowledge, Microsoft best practices, design patterns, troubleshooting expertise, performance optimization, production-ready solutions, and STRICT validation enforcement for activity nesting rules and linked service configurations.
---
## 🚨 CRITICAL GUIDELINES
### Windows File Path Requirements
**MANDATORY: Always Use Backslashes on Windows for File Paths**
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
**Examples:**
- ❌ WRONG: `D:/repos/project/file.tsx`
- ✅ CORRECT: `D:\repos\project\file.tsx`
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
### Documentation Guidelines
**NEVER create new documentation files unless explicitly requested by the user.**
- **Priority**: Update existing README.md files rather than creating new documentation
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
- **User preference**: Only create additional .md files when user specifically asks for documentation
---
# Azure Data Factory Expert Agent
## 🚨 CRITICAL GUIDELINES
### Windows File Path Requirements
**MANDATORY: Always Use Backslashes on Windows for File Paths**
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
**Examples:**
- ❌ WRONG: `D:/repos/project/file.tsx`
- ✅ CORRECT: `D:\repos\project\file.tsx`
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
### Documentation Guidelines
**Never CREATE additional documentation unless explicitly requested by the user.**
- If documentation updates are needed, modify the appropriate existing README.md file
- Do not proactively create new .md files for documentation
- Only create documentation files when the user specifically requests it
---
## CRITICAL: ALWAYS VALIDATE BEFORE CREATING
**BEFORE creating ANY Azure Data Factory pipeline, linked service, or activity:**
1. **Load the validation rules skill** to access comprehensive limitation knowledge
2. **VALIDATE** all activity nesting against permitted/prohibited combinations
3. **REJECT** any configuration that violates ADF limitations
4. **SUGGEST** Execute Pipeline workaround for prohibited nesting scenarios
5. **VERIFY** linked service properties match authentication method requirements
## Core Expertise Areas
### 1. Pipeline Design and Architecture with Validation
- **FIRST**: Validate activity nesting against ADF limitations
- Design efficient, scalable pipeline architectures
- Implement metadata-driven patterns for dynamic processing
- Create reusable pipeline templates
- Design error handling and retry strategies
- Implement logging and monitoring patterns
- **ENFORCE** Execute Pipeline pattern for prohibited nesting scenarios
### 2. Data Transformation
- Design complex transformation logic using Data Flows
- Optimize data flow performance with proper partitioning
- Implement SCD (Slowly Changing Dimension) patterns
- Create incremental load patterns
- Design aggregation and join strategies
### 3. Integration Patterns
- Source-to-sink data movement patterns
- Real-time vs batch processing decisions
- Event-driven architecture with triggers
- Hybrid cloud and on-premises integration
- Multi-cloud data integration
- Microsoft Fabric OneLake and Warehouse integration
### 4. Performance Optimization
- DIU (Data Integration Unit) sizing and optimization
- Partitioning strategies for large datasets
- Staging and compression techniques
- Query optimization at source and sink
- Parallel execution patterns
### 5. Security and Compliance
- Managed Identity implementation (system-assigned and user-assigned)
- Key Vault integration for secrets
- Network security with Private Endpoints
- Data encryption at rest and in transit
- RBAC and access control
## Approach to Problem Solving
### 1. Understand Requirements
- Ask clarifying questions about data sources, targets, and transformations
- Understand volume, velocity, and variety of data
- Identify SLAs and performance requirements
- Consider compliance and security needs
### 2. VALIDATE Before Design (CRITICAL STEP)
- **CHECK** if proposed architecture violates activity nesting rules
- **IDENTIFY** any ForEach/If/Switch/Until nesting conflicts
- **VERIFY** linked service authentication requirements
- **CONFIRM** resource limits won't be exceeded (80 activities per pipeline)
- **REJECT** invalid configurations immediately with clear explanation
### 3. Design Solution
- Propose architecture that meets requirements AND complies with ADF limitations
- Explain trade-offs of different approaches
- Recommend best practices and patterns
- **SUGGEST** Execute Pipeline pattern when nesting limitations encountered
- Consider cost and performance implications
### 4. Provide Implementation Guidance
- Give detailed, production-ready code examples
- Include parameterization and error handling
- Add monitoring and logging
- Document dependencies and prerequisites
- **VALIDATE** final implementation against all ADF rules
### 5. Optimization and Best Practices
- Identify optimization opportunities
- Suggest performance improvements
- Recommend cost-saving measures
- Ensure security best practices
- **ENFORCE** validation rules throughout optimization
## ADF Components You Specialize In
### Linked Services (WITH VALIDATION)
**Azure Blob Storage:**
- Account Key, SAS Token, Service Principal, Managed Identity authentication
- **CRITICAL**: accountKind REQUIRED for managed identity/service principal
- Common pitfalls: Missing accountKind, expired SAS tokens, soft-deleted blobs
**Azure SQL Database:**
- SQL Authentication, Service Principal, Managed Identity
- Connection string parameters: retry logic, pooling, encryption
- Serverless tier considerations
**Microsoft Fabric (2025 NEW):**
- Fabric Lakehouse connector (tables and files)
- Fabric Warehouse connector (T-SQL data warehousing)
- OneLake shortcuts for zero-copy integration
**Other Connectors:**
- ADLS Gen2, Azure Synapse, Cosmos DB
- REST APIs, HTTP endpoints
- On-premises via Self-Hosted IR
- ServiceNow V2 (V1 End of Support)
- Enhanced PostgreSQL and Snowflake
### Activities (WITH NESTING VALIDATION)
**Control Flow - Nesting Rules:**
- Permitted: ForEach to If, ForEach to Switch, Until to If, Until to Switch
- Prohibited: ForEach to ForEach, Until to Until, If to ForEach, Switch to ForEach, If to If, Switch to Switch
- Workaround: Execute Pipeline for all prohibited combinations
**Data Movement and Transformation:**
- Copy Activity: DIUs (2-256), staging, partitioning
- Data Flow: Spark 3.3, column limit less than or equal to 128 chars
- Lookup: 5000 rows max, 4 MB size limit
- ForEach: 50 concurrent max, no Set Variable in parallel mode
- **Invoke Pipeline (NEW 2025)**: Cross-platform calls (ADF to Synapse to Fabric)
### Triggers
- Schedule (cron expressions), Tumbling window (backfill), Event-based (Blob created), Manual
### Integration Runtimes
- Azure IR: Cloud-to-cloud
- Self-Hosted IR: On-premises connectivity
- Azure-SSIS IR: SSIS packages in Azure
## Best Practices You Enforce
### CRITICAL Validation Rules (ALWAYS ENFORCED)
1. **Activity Nesting Validation**: REJECT prohibited combinations
2. **Linked Service Validation**: VERIFY required properties (accountKind, etc.)
3. **Resource Limits**: ENFORCE 80 activities per pipeline, ForEach batchCount less than or equal to 50
4. **Variable Scope**: PREVENT Set Variable in parallel ForEach
### Standard Best Practices
5. **Parameterization**: Everything configurable should be parameterized
6. **Error Handling**: Comprehensive retry and logging
7. **Logging**: Execution details for troubleshooting
8. **Monitoring**: Alerts for failures and performance
9. **Security**: Managed Identity and Key Vault (no hardcoded secrets)
10. **Testing**: Debug mode before production
11. **Incremental Loads**: Avoid full refreshes
12. **Modularity**: Reusable child pipelines via Execute Pipeline
13. **Fabric Integration**: Leverage OneLake shortcuts for zero-copy
## Validation Enforcement Protocol
**CRITICAL: You MUST actively validate and reject invalid configurations**
### Validation Workflow
1. **Analyze user request** for pipeline/activity structure
2. **Identify all control flow activities** (ForEach, If, Switch, Until)
3. **Check nesting hierarchy** against permitted/prohibited rules
4. **Validate linked service** properties match authentication type
5. **Verify resource limits** (80 activities, 50 parameters, etc.)
6. **REJECT immediately** if violations detected with clear explanation
7. **SUGGEST alternatives** (Execute Pipeline pattern for nesting issues)
### Validation Response Template
**When detecting prohibited nesting:**
INVALID PIPELINE STRUCTURE DETECTED
Issue: Specific nesting violation
Location: Pipeline name, Parent activity, Child activity
ADF Limitation:
Explain specific rule with Microsoft Learn reference
RECOMMENDED SOLUTION:
Provide Execute Pipeline workaround with example
**When detecting linked service configuration error:**
INVALID LINKED SERVICE CONFIGURATION
Issue: Missing or incorrect property
Linked Service: Name and type
ADF Requirement:
Explain requirement and why needed
REQUIRED FIX:
Show correct configuration
Common Pitfall:
Explain why error is common and how to avoid
## Communication Style
- **VALIDATE FIRST**: Always check against ADF limitations before solutions
- **REJECT CLEARLY**: Immediately identify violations with rule references
- **PROVIDE ALTERNATIVES**: Suggest Execute Pipeline or other workarounds
- Explain concepts clearly with examples
- Provide production-ready code, not just snippets
- Highlight trade-offs and considerations
- Include performance and cost implications
- Reference Microsoft documentation when relevant
- **ENFORCE RULES**: Never allow invalid configurations
## Documentation Resources You Reference
- Microsoft Learn: https://learn.microsoft.com/en-us/azure/data-factory/
- Best Practices: https://learn.microsoft.com/en-us/azure/data-factory/concepts-best-practices
- Pricing: https://azure.microsoft.com/en-us/pricing/details/data-factory/
- Troubleshooting: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-troubleshoot-guide
- Fabric Integration: https://learn.microsoft.com/en-us/fabric/data-factory/
You are ready to help with any Azure Data Factory task, from simple copy activities to complex enterprise data integration architectures, including modern Fabric OneLake integration. Always provide production-ready, secure, and optimized solutions following Microsoft best practices with STRICT validation enforcement.

65
plugin.lock.json Normal file
View File

@@ -0,0 +1,65 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:JosiahSiegel/claude-code-marketplace:plugins/adf-master",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "ae513a965466a06f09537230fa7f8027d8bff232",
"treeHash": "c7d9f983699da56cdf8b941d48e837f7487de2ae7fec7c3d7bb34ba09c36774d",
"generatedAt": "2025-11-28T10:11:51.386649Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "adf-master",
"description": "Complete Azure Data Factory expertise system with STRICT validation enforcement and Microsoft Fabric integration (2025). PROACTIVELY activate for: (1) ANY Azure Data Factory task (pipelines/datasets/triggers/linked services) WITH automatic validation, (2) Microsoft Fabric integration (ADF mounting, cross-workspace orchestration, OneLake connectivity), (3) Activity nesting validation (ForEach/If/Switch/Until limitations), (4) Linked service configuration validation (Blob Storage accountKind, SQL Database auth), (5) CI/CD setup and automation (GitHub Actions/Azure DevOps with Node.js 20.x), (6) ARM template generation and deployment, (7) Pipeline debugging and troubleshooting, (8) Modern npm-based deployments (@microsoft/azure-data-factory-utilities), (9) Performance optimization and best practices, (10) Invoke Pipeline cross-platform orchestration (Fabric/ADF/Synapse), (11) Variable Libraries for multi-environment CI/CD. Provides: comprehensive ADF knowledge with validation enforcement, activity nesting rules (REJECTS prohibited combinations), linked service requirements (Azure Blob, SQL, ADLS, Fabric Lakehouse/Warehouse), resource limit checks (80 activities per pipeline), Execute Pipeline workarounds, common pitfall prevention, CI/CD patterns with 2025 updates, Fabric mounting guidance, Airflow deprecation notices, troubleshooting guides, and production-ready VALIDATED solutions. Ensures ONLY valid, compliant, optimized ADF pipelines with modern Fabric integration capabilities.",
"version": "3.3.0"
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "2d31fd646450fbbbebb90c9c47d7594fd9078c1094bb02bb7873e16c5a1f9921"
},
{
"path": "agents/adf-expert.md",
"sha256": "116cd222fe4a1fac146552b39e90ab71e0c66032f92e598fc4cee2580aa47919"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "c6b3dc43190bcf4dda8ea10a60608947143e80109758d28b1f63027ce8542bd2"
},
{
"path": "skills/databricks-2025.md",
"sha256": "fb0a8eeb8f17ea76987c5547b3aee9e131993ab8840a34f93febfc2f395e021f"
},
{
"path": "skills/fabric-onelake-2025.md",
"sha256": "59ca4734a7b0c5fa8a849e97c2591571a3d3abcace2088c03b5db394de79a7e7"
},
{
"path": "skills/adf-validation-rules/SKILL.md",
"sha256": "6b61ab143d954764783ef368b41d93ec3863b8657abc531805c14a030c30cf77"
},
{
"path": "skills/windows-git-bash-compatibility/SKILL.md",
"sha256": "47279b688ce6960ccefb4b56594d5b9176216a34da9b5c9be6ca8a02de7de1d1"
},
{
"path": "skills/adf-master/SKILL.md",
"sha256": "da041e23abf060c591d0ff1107af92ce2acb2092e1af0e489a5b74635df59b93"
}
],
"dirSha256": "c7d9f983699da56cdf8b941d48e837f7487de2ae7fec7c3d7bb34ba09c36774d"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

616
skills/adf-master/SKILL.md Normal file
View File

@@ -0,0 +1,616 @@
---
name: adf-master
description: Comprehensive Azure Data Factory knowledge base with official documentation sources, CI/CD methods, deployment patterns, and troubleshooting resources
---
# Azure Data Factory Master Knowledge Base
## 🚨 CRITICAL GUIDELINES
### Windows File Path Requirements
**MANDATORY: Always Use Backslashes on Windows for File Paths**
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
**Examples:**
- ❌ WRONG: `D:/repos/project/file.tsx`
- ✅ CORRECT: `D:\repos\project\file.tsx`
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
### Documentation Guidelines
**NEVER create new documentation files unless explicitly requested by the user.**
- **Priority**: Update existing README.md files rather than creating new documentation
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
- **User preference**: Only create additional .md files when user specifically asks for documentation
---
This skill provides comprehensive reference information about Azure Data Factory, including official documentation sources, CI/CD deployment methods, and troubleshooting resources. Use this to access detailed ADF knowledge on-demand.
## 🚨 CRITICAL 2025 UPDATE: Deprecated Features### Apache Airflow Workflow Orchestration Manager - DEPRECATED**Status:** Available only for existing customers as of early 2025**Retirement Date:** Not yet announced, but feature is officially deprecated**Impact:** New customers cannot provision Apache Airflow in Azure Data Factory**Official Deprecation Notice:**- Apache Airflow Workflow Orchestration Manager is deprecated with no retirement date set- Only existing deployments can continue using this feature- No new Airflow integrations can be created in ADF**Migration Path:**- **Recommended:** Migrate to Fabric Data Factory with native Airflow support- **Alternative:** Use standalone Apache Airflow deployments (Azure Container Instances, AKS, or VM-based)- **Alternative:** Migrate orchestration logic to native ADF pipelines with control flow activities**Why Deprecated:**- Microsoft focus shifted to Fabric Data Factory as the unified data integration platform- Fabric provides modern orchestration capabilities superseding Airflow integration- Limited adoption and maintenance burden for standalone Airflow feature in ADF**Action Required:**- If using Airflow in ADF: Plan migration within 12-18 months- For new projects: Do NOT use Airflow in ADF - use Fabric or native ADF patterns- Monitor Microsoft announcements for official retirement timeline**Reference:**- Microsoft Roadmap: https://www.directionsonmicrosoft.com/roadmaps/ref/azure-data-factory-roadmap/## 🆕 2025 Feature Updates### Microsoft Fabric Integration (GA June 2025)**ADF Mounting in Fabric:**- Bring existing ADF pipelines into Fabric workspaces without rebuilding- General Availability as of June 2025- Seamless integration enables hybrid ADF + Fabric workflows**Cross-Workspace Pipeline Orchestration:**- New **Invoke Pipeline** activity supports cross-platform calls- Invoke pipelines across Fabric, Azure Data Factory, and Synapse- Managed VNet support for secure cross-workspace communication**Variable Libraries:**- Environment-specific variables for CI/CD automation- Automatic value substitution during workspace promotion- Eliminates separate parameter files per environment**Connector Enhancements:**- ServiceNow V2 (V1 End of Support)- Enhanced PostgreSQL and Snowflake connectors- Native OneLake connectivity for zero-copy integration### Node.js 20.x Requirement for CI/CD**CRITICAL:** As of 2025, npm package `@microsoft/azure-data-factory-utilities` requires Node.js 20.x**Breaking Change:**- Older Node.js versions (14.x, 16.x, 18.x) may cause package incompatibility errors- Update CI/CD pipelines to use Node.js 20.x or compatible versions**GitHub Actions:**```yaml- name: Setup Node.js uses: actions/setup-node@v4 with: node-version: '20.x'```**Azure DevOps:**```yaml- task: UseNode@1 inputs: version: '20.x'```
## 🚨 CRITICAL 2025 UPDATE: Deprecated Features
### Apache Airflow Workflow Orchestration Manager - DEPRECATED
**Status:** Available only for existing customers as of early 2025
**Retirement Date:** Not yet announced, but feature is officially deprecated
**Impact:** New customers cannot provision Apache Airflow in Azure Data Factory
**Official Deprecation Notice:**
- Apache Airflow Workflow Orchestration Manager is deprecated with no retirement date set
- Only existing deployments can continue using this feature
- No new Airflow integrations can be created in ADF
**Migration Path:**
- **Recommended:** Migrate to Fabric Data Factory with native Airflow support
- **Alternative:** Use standalone Apache Airflow deployments (Azure Container Instances, AKS, or VM-based)
- **Alternative:** Migrate orchestration logic to native ADF pipelines with control flow activities
**Why Deprecated:**
- Microsoft focus shifted to Fabric Data Factory as the unified data integration platform
- Fabric provides modern orchestration capabilities superseding Airflow integration
- Limited adoption and maintenance burden for standalone Airflow feature in ADF
**Action Required:**
- If using Airflow in ADF: Plan migration within 12-18 months
- For new projects: Do NOT use Airflow in ADF - use Fabric or native ADF patterns
- Monitor Microsoft announcements for official retirement timeline
**Reference:**
- Microsoft Roadmap: https://www.directionsonmicrosoft.com/roadmaps/ref/azure-data-factory-roadmap/
## 🆕 2025 Feature Updates
### Microsoft Fabric Integration (GA June 2025)
**ADF Mounting in Fabric:**
- Bring existing ADF pipelines into Fabric workspaces without rebuilding
- General Availability as of June 2025
- Seamless integration enables hybrid ADF + Fabric workflows
**Cross-Workspace Pipeline Orchestration:**
- New **Invoke Pipeline** activity supports cross-platform calls
- Invoke pipelines across Fabric, Azure Data Factory, and Synapse
- Managed VNet support for secure cross-workspace communication
**Variable Libraries:**
- Environment-specific variables for CI/CD automation
- Automatic value substitution during workspace promotion
- Eliminates separate parameter files per environment
**Connector Enhancements:**
- ServiceNow V2 (V1 End of Support)
- Enhanced PostgreSQL and Snowflake connectors
- Native OneLake connectivity for zero-copy integration
### Node.js 20.x Requirement for CI/CD
**CRITICAL:** As of 2025, npm package `@microsoft/azure-data-factory-utilities` requires Node.js 20.x
**Breaking Change:**
- Older Node.js versions (14.x, 16.x, 18.x) may cause package incompatibility errors
- Update CI/CD pipelines to use Node.js 20.x or compatible versions
**GitHub Actions:**
```yaml
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20.x'
```
**Azure DevOps:**
```yaml
- task: UseNode@1
inputs:
version: '20.x'
```
## Official Documentation Sources
### Primary Microsoft Learn Resources
**Main Documentation Hub:**
- URL: https://learn.microsoft.com/en-us/azure/data-factory/
- Last Updated: February 2025
- Coverage: Complete ADF documentation including tutorials, concepts, how-to guides, and reference materials
- Key Topics: Pipelines, datasets, triggers, linked services, data flows, integration runtimes, monitoring
**Introduction to Azure Data Factory:**
- URL: https://learn.microsoft.com/en-us/azure/data-factory/introduction
- Summary: Managed cloud service for complex hybrid ETL, ELT, and data integration projects
- Key Features: 90+ built-in connectors, serverless architecture, code-free UI, single-pane monitoring
### Context7 Library Documentation
**Library ID:** `/websites/learn_microsoft_en-us_azure_data-factory`
- Trust Score: 7.5
- Code Snippets: 10,839
- Topics: CI/CD, ARM templates, pipeline patterns, data flows, monitoring, troubleshooting
**How to Access:**
```
Use Context7 MCP tool to fetch latest documentation:
mcp__context7__get-library-docs:
- context7CompatibleLibraryID: /websites/learn_microsoft_en-us_azure_data-factory
- topic: "CI/CD continuous integration deployment pipelines ARM templates"
- tokens: 8000
```
## CI/CD Deployment Methods
### Modern Automated Approach (Recommended)
**npm Package:** `@microsoft/azure-data-factory-utilities`
- **Latest Version:** 1.0.3+ (check npm for current version)
- **npm URL:** https://www.npmjs.com/package/@microsoft/azure-data-factory-utilities
- **Node.js Requirement:** Version 20.x or compatible
**Key Features:**
- Validates ADF resources independently of service
- Generates ARM templates programmatically
- Enables true CI/CD without manual publish button
- Supports preview mode for selective trigger management
**package.json Configuration:**
```json
{
"scripts": {
"build": "node node_modules/@microsoft/azure-data-factory-utilities/lib/index",
"build-preview": "node node_modules/@microsoft/azure-data-factory-utilities/lib/index --preview"
},
"dependencies": {
"@microsoft/azure-data-factory-utilities": "^1.0.3"
}
}
```
**Commands:**
```bash
# Validate resources
npm run build validate <rootFolder> <factoryId>
# Generate ARM templates
npm run build export <rootFolder> <factoryId> [outputFolder]
# Preview mode (only stop/start modified triggers)
npm run build-preview export <rootFolder> <factoryId> [outputFolder]
```
**Official Documentation:**
- URL: https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-delivery-improvements
- Last Updated: January 2025
- Topics: Setup, configuration, build commands, CI/CD integration
### Traditional Manual Approach (Legacy)
**Method:** Git integration + Publish button
**Process:**
1. Configure Git integration in ADF UI (Dev environment only)
2. Make changes in ADF Studio
3. Click "Publish" button to generate ARM templates
4. Templates saved to `adf_publish` branch
5. Release pipelines deploy from `adf_publish` branch
**When to Use:**
- Migrating from existing setup
- No build pipeline infrastructure
- Simple deployments without validation
**Limitations:**
- Requires manual publish action
- No validation until publish
- Not true CI/CD (manual step required)
- Can't validate on pull requests
**Migration Path:** Modern approach recommended for new implementations
## ARM Template Deployment
### PowerShell Deployment
**Primary Command:** `New-AzResourceGroupDeployment`
**Syntax:**
```powershell
New-AzResourceGroupDeployment `
-ResourceGroupName "<resource-group-name>" `
-TemplateFile "ARMTemplateForFactory.json" `
-TemplateParameterFile "ARMTemplateParametersForFactory.<environment>.json" `
-factoryName "<factory-name>" `
-Mode Incremental `
-Verbose
```
**Validation:**
```powershell
Test-AzResourceGroupDeployment `
-ResourceGroupName "<resource-group-name>" `
-TemplateFile "ARMTemplateForFactory.json" `
-TemplateParameterFile "ARMTemplateParametersForFactory.<environment>.json" `
-factoryName "<factory-name>"
```
**What-If Analysis:**
```powershell
New-AzResourceGroupDeployment `
-ResourceGroupName "<resource-group-name>" `
-TemplateFile "ARMTemplateForFactory.json" `
-TemplateParameterFile "ARMTemplateParametersForFactory.<environment>.json" `
-factoryName "<factory-name>" `
-WhatIf
```
### Azure CLI Deployment
**Primary Command:** `az deployment group create`
**Syntax:**
```bash
az deployment group create \
--resource-group <resource-group-name> \
--template-file ARMTemplateForFactory.json \
--parameters ARMTemplateParametersForFactory.<environment>.json \
--parameters factoryName=<factory-name> \
--mode Incremental
```
**Validation:**
```bash
az deployment group validate \
--resource-group <resource-group-name> \
--template-file ARMTemplateForFactory.json \
--parameters ARMTemplateParametersForFactory.<environment>.json \
--parameters factoryName=<factory-name>
```
**What-If Analysis:**
```bash
az deployment group what-if \
--resource-group <resource-group-name> \
--template-file ARMTemplateForFactory.json \
--parameters ARMTemplateParametersForFactory.<environment>.json \
--parameters factoryName=<factory-name>
```
## PrePostDeploymentScript
### Current Version: Ver2
**Location:** https://github.com/Azure/Azure-DataFactory/blob/main/SamplesV2/ContinuousIntegrationAndDelivery/PrePostDeploymentScript.Ver2.ps1
**Key Improvement in Ver2:**
- Turns off/on ONLY triggers that have been modified
- Ver1 stopped/started ALL triggers (slower, more disruptive)
- Compares trigger payloads to determine changes
**Download Command:**
```bash
# Linux/macOS/Git Bash
curl -o PrePostDeploymentScript.Ver2.ps1 https://raw.githubusercontent.com/Azure/Azure-DataFactory/main/SamplesV2/ContinuousIntegrationAndDelivery/PrePostDeploymentScript.Ver2.ps1
# PowerShell
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/Azure/Azure-DataFactory/main/SamplesV2/ContinuousIntegrationAndDelivery/PrePostDeploymentScript.Ver2.ps1" -OutFile "PrePostDeploymentScript.Ver2.ps1"
```
### Parameters
**Pre-Deployment (Stop Triggers):**
```powershell
./PrePostDeploymentScript.Ver2.ps1 `
-armTemplate "<path-to-ARMTemplateForFactory.json>" `
-ResourceGroupName "<resource-group-name>" `
-DataFactoryName "<factory-name>" `
-predeployment $true `
-deleteDeployment $false
```
**Post-Deployment (Start Triggers & Cleanup):**
```powershell
./PrePostDeploymentScript.Ver2.ps1 `
-armTemplate "<path-to-ARMTemplateForFactory.json>" `
-ResourceGroupName "<resource-group-name>" `
-DataFactoryName "<factory-name>" `
-predeployment $false `
-deleteDeployment $true
```
### PowerShell Requirements
**Version:** PowerShell Core (7.0+) recommended
- Azure DevOps: Use `pwsh: true` in AzurePowerShell@5 task
- Locally: Use `pwsh` command, not `powershell`
**Modules Required:**
- Az.DataFactory
- Az.Resources
**Official Documentation:**
- URL: https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-delivery-sample-script
- Last Updated: January 2025
## GitHub Actions CI/CD
### Official Resources
**Medium Article (Recent 2025):**
- URL: https://medium.com/microsoftazure/azure-data-factory-build-and-deploy-with-new-ci-cd-flow-using-github-actions-cd46c95054e0
- Author: Jared Zagelbaum (Microsoft Azure)
- Topics: Modern CI/CD flow, npm package usage, GitHub Actions setup
**Microsoft Community Hub:**
- URL: https://techcommunity.microsoft.com/blog/fasttrackforazureblog/azure-data-factory-cicd-with-github-actions/3768493
- Topics: End-to-end GitHub Actions setup, workload identity federation
**Community Blog (February 2025):**
- URL: https://linusdata.blog/2025/03/14/automating-azure-data-factory-deployments-with-github-actions/
- Topics: Practical implementation guide, troubleshooting tips
### Key GitHub Actions
**Essential Actions:**
- `actions/checkout@v4` - Checkout repository
- `actions/setup-node@v4` - Setup Node.js
- `actions/upload-artifact@v4` - Publish ARM templates
- `actions/download-artifact@v4` - Download ARM templates in deploy workflow
- `azure/login@v2` - Authenticate to Azure
- `azure/arm-deploy@v2` - Deploy ARM templates
- `azure/powershell@v2` - Run PrePostDeploymentScript
### Authentication Methods
**Service Principal (JSON credentials):**
```json
{
"clientId": "<GUID>",
"clientSecret": "<STRING>",
"subscriptionId": "<GUID>",
"tenantId": "<GUID>"
}
```
Store in GitHub secret: `AZURE_CREDENTIALS`
**Workload Identity Federation (More secure):**
- No secrets stored
- Uses OIDC (OpenID Connect)
- Recommended for production
- Setup: https://learn.microsoft.com/en-us/azure/developer/github/connect-from-azure
## Azure DevOps CI/CD
### Official Resources
**Microsoft Learn:**
- URL: https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-delivery-automate-azure-pipelines
- Topics: Build pipeline, release pipeline, service connections, variable groups
**Community Guides:**
- Adam Marczak Blog: https://marczak.io/posts/2023/02/quick-cicd-for-data-factory/
- Topics: Quick setup, best practices, folder structure
**Towards Data Science:**
- URL: https://towardsdatascience.com/azure-data-factory-ci-cd-made-simple-building-and-deploying-your-arm-templates-with-azure-devops-30c30595afa5
- Topics: ARM template build and deployment workflow
### Key Azure DevOps Tasks
**Build Pipeline Tasks:**
- `UseNode@1` - Install Node.js
- `Npm@1` - Install packages, run build commands
- `PublishPipelineArtifact@1` - Publish ARM templates
**Release Pipeline Tasks:**
- `DownloadPipelineArtifact@2` - Download ARM templates
- `AzurePowerShell@5` - Run PrePostDeploymentScript
- `AzureResourceManagerTemplateDeployment@3` - Deploy ARM template
### Service Connection Requirements
**Permissions Needed:**
- Data Factory Contributor (on all Data Factories)
- Contributor (on Resource Groups)
- Key Vault access policies (if using secrets)
**Configuration:**
- Project Settings → Service connections → New service connection
- Type: Azure Resource Manager
- Authentication: Service Principal (recommended) or Managed Identity
## Troubleshooting Resources
### Official Troubleshooting Guide
**URL:** https://learn.microsoft.com/en-us/azure/data-factory/ci-cd-github-troubleshoot-guide
**Last Updated:** January 2025
**Common Issues Covered:**
1. Template parameter validation errors
2. Integration Runtime type cannot be changed
3. ARM template size exceeds 4MB limit
4. Git connection problems
5. Authentication failures
6. Deployment errors
### Diagnostic Logs
**Enable Diagnostic Settings:**
```
Azure Portal → Data Factory → Diagnostic settings → Add diagnostic setting
Send to: Log Analytics workspace
Logs to Enable:
- PipelineRuns
- TriggerRuns
- ActivityRuns
- SandboxPipelineRuns
- SandboxActivityRuns
```
**Kusto Queries for Troubleshooting:**
```kusto
// Failed pipeline runs in last 24 hours
ADFPipelineRun
| where Status == "Failed"
| where TimeGenerated > ago(24h)
| project TimeGenerated, PipelineName, RunId, Status, ErrorMessage, Parameters
| order by TimeGenerated desc
// Failed CI/CD deployments
ADFActivityRun
| where ActivityType == "ExecutePipeline"
| where Status == "Failed"
| where TimeGenerated > ago(7d)
| project TimeGenerated, PipelineName, ActivityName, ErrorCode, ErrorMessage
| order by TimeGenerated desc
// Performance analysis
ADFActivityRun
| where TimeGenerated > ago(7d)
| extend DurationMinutes = datetime_diff('minute', End, Start)
| summarize AvgDuration = avg(DurationMinutes) by ActivityType, ActivityName
| where AvgDuration > 10
| order by AvgDuration desc
```
### Common Error Patterns
**Error: "Template parameters are not valid"**
- Cause: Deleted triggers still referenced in parameters
- Solution: Regenerate ARM template or use PrePostDeploymentScript cleanup
**Error: "Updating property type is not supported"**
- Cause: Trying to change Integration Runtime type
- Solution: Delete and recreate IR (not in-place update)
**Error: "Operation timed out"**
- Cause: Network connectivity, large data volume, insufficient compute
- Solution: Increase timeout, optimize query, increase DIUs
**Error: "Authentication failed"**
- Cause: Service principal expired, missing permissions, wrong credentials
- Solution: Verify credentials, check role assignments, renew if expired
## Best Practices
### Repository Structure
**Recommended Folder Layout:**
```
repository-root/
├── adf-resources/ # ADF JSON files (if using npm approach)
│ ├── dataset/
│ ├── pipeline/
│ ├── trigger/
│ ├── linkedService/
│ └── integrationRuntime/
├── .github/
│ └── workflows/ # GitHub Actions workflows
│ ├── adf-build.yml
│ └── adf-deploy.yml
├── azure-pipelines/ # Azure DevOps pipelines
│ ├── build.yml
│ └── release.yml
├── parameters/ # Environment-specific parameters
│ ├── ARMTemplateParametersForFactory.dev.json
│ ├── ARMTemplateParametersForFactory.test.json
│ └── ARMTemplateParametersForFactory.prod.json
├── package.json # npm configuration
└── README.md
```
### Git Configuration
**Only Configure Git on Development ADF:**
- Development: Git-integrated for source control
- Test: CI/CD deployment only (no Git)
- Production: CI/CD deployment only (no Git)
**Rationale:** Prevents accidental manual changes in higher environments
### Multi-Environment Strategy
```
Environment Flow:
Dev (Git) → Build → Test → Approval → Production
ARM Templates
```
**Parameter Management:**
- Separate parameter file per environment
- Store secrets in Azure Key Vault
- Reference Key Vault in parameter files
- Never commit secrets to source control
### Monitoring and Alerting
**Set up alerts for:**
- Build pipeline failures
- Deployment failures
- Pipeline run failures
- Performance degradation
- Cost anomalies
**Recommended Tools:**
- Azure Monitor (Metrics and Alerts)
- Log Analytics (Kusto queries)
- Application Insights (for custom logging)
- Azure Advisor (optimization recommendations)
## Additional Resources
### GitHub Repositories
**Official Azure Data Factory Samples:**
- URL: https://github.com/Azure/Azure-DataFactory
- Path: SamplesV2/ContinuousIntegrationAndDelivery/
- Contents: PrePostDeploymentScript.Ver2.ps1, example pipelines, documentation
**Community Examples:**
- Search GitHub for "azure-data-factory-cicd" for real-world examples
- Many organizations publish their CI/CD patterns as reference
### Community Support
**Microsoft Q&A:**
- URL: https://learn.microsoft.com/en-us/answers/tags/130/azure-data-factory
- Active community, Microsoft employees respond
**Stack Overflow:**
- Tag: `azure-data-factory`
- Large knowledge base of resolved issues
**Azure Status:**
- URL: https://status.azure.com
- Check for service outages and incidents
## When to Fetch Latest Information
**Situations requiring current documentation:**
1. npm package version updates
2. New ADF features or activities
3. Changes to ARM template schema
4. Updates to PrePostDeploymentScript
5. New GitHub Actions or Azure DevOps tasks
6. Breaking changes or deprecations
**How to Fetch:**
- Use WebFetch for Microsoft Learn articles
- Check npm for latest package version
- Use Context7 for comprehensive topic coverage
- Review Azure Data Factory GitHub for script updates
This knowledge base should be your starting point for all Azure Data Factory questions. Always verify critical information with the latest official documentation when making production decisions.

View File

@@ -0,0 +1,604 @@
---
name: adf-validation-rules
description: Comprehensive Azure Data Factory validation rules, activity nesting limitations, linked service requirements, and edge-case handling guidance
---
## 🚨 CRITICAL GUIDELINES
### Windows File Path Requirements
**MANDATORY: Always Use Backslashes on Windows for File Paths**
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
**Examples:**
- ❌ WRONG: `D:/repos/project/file.tsx`
- ✅ CORRECT: `D:\repos\project\file.tsx`
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
### Documentation Guidelines
**NEVER create new documentation files unless explicitly requested by the user.**
- **Priority**: Update existing README.md files rather than creating new documentation
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
- **User preference**: Only create additional .md files when user specifically asks for documentation
---
# Azure Data Factory Validation Rules and Limitations
## 🚨 CRITICAL: Activity Nesting Limitations
Azure Data Factory has **STRICT** nesting rules for control flow activities. Violating these rules will cause pipeline failures or prevent pipeline creation.
### Supported Control Flow Activities for Nesting
Four control flow activities support nested activities:
- **ForEach**: Iterates over collections and executes activities in a loop
- **If Condition**: Branches based on true/false evaluation
- **Until**: Implements do-until loops with timeout options
- **Switch**: Evaluates activities matching case conditions
### ✅ PERMITTED Nesting Combinations
| Parent Activity | Can Contain | Notes |
|----------------|-------------|-------|
| **ForEach** | If Condition | ✅ Allowed |
| **ForEach** | Switch | ✅ Allowed |
| **Until** | If Condition | ✅ Allowed |
| **Until** | Switch | ✅ Allowed |
### ❌ PROHIBITED Nesting Combinations
| Parent Activity | CANNOT Contain | Reason |
|----------------|----------------|---------|
| **If Condition** | ForEach | ❌ Not supported - use Execute Pipeline workaround |
| **If Condition** | Switch | ❌ Not supported - use Execute Pipeline workaround |
| **If Condition** | Until | ❌ Not supported - use Execute Pipeline workaround |
| **If Condition** | Another If | ❌ Cannot nest If within If |
| **Switch** | ForEach | ❌ Not supported - use Execute Pipeline workaround |
| **Switch** | If Condition | ❌ Not supported - use Execute Pipeline workaround |
| **Switch** | Until | ❌ Not supported - use Execute Pipeline workaround |
| **Switch** | Another Switch | ❌ Cannot nest Switch within Switch |
| **ForEach** | Another ForEach | ❌ Single level only - use Execute Pipeline workaround |
| **Until** | Another Until | ❌ Single level only - use Execute Pipeline workaround |
| **ForEach** | Until | ❌ Single level only - use Execute Pipeline workaround |
| **Until** | ForEach | ❌ Single level only - use Execute Pipeline workaround |
### 🚫 Special Activity Restrictions
**Validation Activity**:
-**CANNOT** be placed inside ANY nested activity
-**CANNOT** be used within ForEach, If, Switch, or Until activities
- ✅ Must be at pipeline root level only
### 🔧 Workaround: Execute Pipeline Pattern
**The ONLY supported workaround for prohibited nesting combinations:**
Instead of direct nesting, use the **Execute Pipeline Activity** to call a child pipeline:
```json
{
"name": "ParentPipeline_WithIfCondition",
"activities": [
{
"name": "IfCondition_Parent",
"type": "IfCondition",
"typeProperties": {
"expression": "@equals(pipeline().parameters.ProcessData, 'true')",
"ifTrueActivities": [
{
"name": "ExecuteChildPipeline_WithForEach",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ChildPipeline_ForEachLoop",
"type": "PipelineReference"
},
"parameters": {
"ItemList": "@pipeline().parameters.Items"
}
}
}
]
}
}
]
}
```
**Child Pipeline Structure:**
```json
{
"name": "ChildPipeline_ForEachLoop",
"parameters": {
"ItemList": {"type": "array"}
},
"activities": [
{
"name": "ForEach_InChildPipeline",
"type": "ForEach",
"typeProperties": {
"items": "@pipeline().parameters.ItemList",
"activities": [
// Your ForEach logic here
]
}
}
]
}
```
**Why This Works:**
- Each pipeline can have ONE level of nesting
- Execute Pipeline creates a new pipeline context
- Child pipeline gets its own nesting level allowance
- Enables unlimited depth through pipeline chaining
## 🔢 Activity and Resource Limits
### Pipeline Limits
| Resource | Limit | Notes |
|----------|-------|-------|
| **Activities per pipeline** | 80 | Includes inner activities for containers |
| **Parameters per pipeline** | 50 | - |
| **ForEach concurrent iterations** | 50 (maximum) | Set via `batchCount` property |
| **ForEach items** | 100,000 | - |
| **Lookup activity rows** | 5,000 | Maximum rows returned |
| **Lookup activity size** | 4 MB | Maximum size of returned data |
| **Web activity timeout** | 1 hour | Default timeout for Web activities |
| **Copy activity timeout** | 7 days | Maximum execution time |
### ForEach Activity Configuration
```json
{
"name": "ForEachActivity",
"type": "ForEach",
"typeProperties": {
"items": "@pipeline().parameters.ItemList",
"isSequential": false, // false = parallel execution
"batchCount": 50, // Max 50 concurrent iterations
"activities": [
// Nested activities
]
}
}
```
**Critical Considerations:**
- `isSequential: true` → Executes one item at a time (slow but predictable)
- `isSequential: false` → Executes up to `batchCount` items in parallel
- Maximum `batchCount` is **50** regardless of setting
- **Cannot use Set Variable activity** inside parallel ForEach (variable scope is pipeline-level)
### Set Variable Activity Limitations
**CANNOT** use `Set Variable` inside ForEach with `isSequential: false`
- Reason: Variables are pipeline-scoped, not ForEach-scoped
- Multiple parallel iterations would cause race conditions
-**Alternative**: Use `Append Variable` with array type, or use sequential execution
## 📊 Linked Services: Azure Blob Storage
### Authentication Methods
#### 1. Account Key (Basic)
```json
{
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<account>;AccountKey=<key>"
}
}
}
```
**⚠️ Limitations:**
- Secondary Blob service endpoints are **NOT supported**
- **Security Risk**: Account keys should be stored in Azure Key Vault
#### 2. Shared Access Signature (SAS)
```json
{
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "https://<account>.blob.core.windows.net/<container>?<SAS-token>"
}
}
}
```
**Critical Requirements:**
- Dataset `folderPath` must be **absolute path from container level**
- SAS token expiry **must extend beyond pipeline execution**
- SAS URI path must align with dataset configuration
#### 3. Service Principal
```json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<account>.blob.core.windows.net",
"accountKind": "StorageV2", // REQUIRED for service principal
"servicePrincipalId": "<client-id>",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<client-secret>"
},
"tenant": "<tenant-id>"
}
}
```
**Critical Requirements:**
- `accountKind` **MUST** be set (StorageV2, BlobStorage, or BlockBlobStorage)
- Service Principal requires **Storage Blob Data Reader** (source) or **Storage Blob Data Contributor** (sink) role
-**NOT compatible** with soft-deleted blob accounts in Data Flow
#### 4. Managed Identity (Recommended)
```json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<account>.blob.core.windows.net",
"accountKind": "StorageV2" // REQUIRED for managed identity
},
"connectVia": {
"referenceName": "AutoResolveIntegrationRuntime",
"type": "IntegrationRuntimeReference"
}
}
```
**Critical Requirements:**
- `accountKind` **MUST** be specified (cannot be empty or "Storage")
- ❌ Empty or "Storage" account kind will cause Data Flow failures
- Managed identity must have **Storage Blob Data Reader/Contributor** role assigned
- For Storage firewall: **Must enable "Allow trusted Microsoft services"**
### Common Blob Storage Pitfalls
| Issue | Cause | Solution |
|-------|-------|----------|
| Data Flow fails with managed identity | `accountKind` empty or "Storage" | Set `accountKind` to StorageV2 |
| Secondary endpoint doesn't work | Using account key auth | Not supported - use different auth method |
| SAS token expired during run | Token expiry too short | Extend SAS token validity period |
| Cannot access $logs container | System container not visible in UI | Use direct path reference |
| Soft-deleted blobs inaccessible | Service principal/managed identity | Use account key or SAS instead |
| Private endpoint connection fails | Wrong endpoint for Data Flow | Ensure ADLS Gen2 private endpoint exists |
## 📊 Linked Services: Azure SQL Database
### Authentication Methods
#### 1. SQL Authentication
```json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "SQL",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
```
**Best Practice:**
- Store password in Azure Key Vault
- Use connection string with Key Vault reference
#### 2. Service Principal
```json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "ServicePrincipal",
"servicePrincipalId": "<client-id>",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<client-secret>"
},
"tenant": "<tenant-id>"
}
}
```
**Requirements:**
- Microsoft Entra admin must be configured on SQL server
- Service principal must have contained database user created
- Grant appropriate role: `db_datareader`, `db_datawriter`, etc.
#### 3. Managed Identity
```json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "SystemAssignedManagedIdentity"
}
}
```
**Requirements:**
- Create contained database user for managed identity
- Grant appropriate database roles
- Configure firewall to allow Azure services (or specific IP ranges)
### SQL Database Configuration Best Practices
#### Connection String Parameters
```
Server=tcp:<server>.database.windows.net,1433;
Database=<database>;
Encrypt=mandatory; // Options: mandatory, optional, strict
TrustServerCertificate=false;
ConnectTimeout=30;
CommandTimeout=120;
Pooling=true;
ConnectRetryCount=3;
ConnectRetryInterval=10;
```
**Critical Parameters:**
- `Encrypt`: Default is `mandatory` (recommended)
- `Pooling`: Set to `false` if experiencing idle connection issues
- `ConnectRetryCount`: Recommended for transient fault handling
- `ConnectRetryInterval`: Seconds between retries
### Common SQL Database Pitfalls
| Issue | Cause | Solution |
|-------|-------|----------|
| Serverless tier auto-paused | Pipeline doesn't wait for resume | Implement retry logic or keep-alive |
| Connection pool timeout | Idle connections closed | Add `Pooling=false` or configure retry |
| Firewall blocks connection | IP not whitelisted | Add Azure IR IPs or enable Azure services |
| Always Encrypted fails in Data Flow | Not supported for sink | Use service principal/managed identity in copy activity |
| Decimal precision loss | Copy supports up to 28 precision | Use string type for higher precision |
| Parallel copy not working | No partition configuration | Enable physical or dynamic range partitioning |
### Performance Optimization
#### Parallel Copy Configuration
```json
{
"source": {
"type": "AzureSqlSource",
"partitionOption": "PhysicalPartitionsOfTable" // or "DynamicRange"
},
"parallelCopies": 8, // Recommended: (DIU or IR nodes) × (2 to 4)
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
}
}
}
```
**Partition Options:**
- `PhysicalPartitionsOfTable`: Uses SQL Server physical partitions
- `DynamicRange`: Creates logical partitions based on column values
- `None`: No partitioning (default)
**Staging Best Practices:**
- Always use staging for large data movements (> 1GB)
- Use PolyBase or COPY statement for best performance
- Parquet format recommended for staging files
## 🔍 Data Flow Limitations
### General Limits
- **Column name length**: 128 characters maximum
- **Row size**: 1 MB maximum (some sinks like SQL have lower limits)
- **String column size**: Varies by sink (SQL: 8000 for varchar, 4000 for nvarchar)
### Transformation-Specific Limits
| Transformation | Limitation |
|----------------|------------|
| **Lookup** | Cache size limited by cluster memory |
| **Join** | Large joins may cause memory errors |
| **Pivot** | Maximum 10,000 unique values |
| **Window** | Requires partitioning for large datasets |
### Performance Considerations
- **Partitioning**: Always partition large datasets before transformations
- **Broadcast**: Use broadcast hint for small dimension tables
- **Sink optimization**: Enable table option "Recreate" instead of "Truncate" for better performance
## 🛡️ Validation Checklist for Pipeline Creation
### Before Creating Pipeline
- [ ] Verify activity nesting follows permitted combinations
- [ ] Check ForEach activities don't contain other ForEach/Until
- [ ] Verify If/Switch activities don't contain ForEach/Until/If/Switch
- [ ] Ensure Validation activities are at pipeline root level only
- [ ] Confirm total activities < 80 per pipeline
- [ ] Verify no Set Variable activities in parallel ForEach
### Linked Service Validation
- [ ] **Blob Storage**: If using managed identity/service principal, `accountKind` is set
- [ ] **SQL Database**: Authentication method matches security requirements
- [ ] **All services**: Secrets stored in Key Vault, not hardcoded
- [ ] **All services**: Firewall rules configured for integration runtime IPs
- [ ] **Network**: Private endpoints configured if using VNet integration
### Activity Configuration Validation
- [ ] **ForEach**: `batchCount` ≤ 50 if parallel execution
- [ ] **Lookup**: Query returns < 5000 rows and < 4 MB data
- [ ] **Copy**: DIU configured appropriately (2-256 for Azure IR)
- [ ] **Copy**: Staging enabled for large data movements
- [ ] **All activities**: Timeout values appropriate for expected execution time
- [ ] **All activities**: Retry logic configured for transient failures
### Data Flow Validation
- [ ] Column names ≤ 128 characters
- [ ] Source query doesn't return > 1 MB per row
- [ ] Partitioning configured for large datasets
- [ ] Sink has appropriate schema and data type mappings
- [ ] Staging linked service configured for optimal performance
## 🔍 Automated Validation Script
**CRITICAL: Always run automated validation before committing or deploying ADF pipelines!**
The adf-master plugin includes a comprehensive PowerShell validation script that checks for ALL the rules and limitations documented above.
### Using the Validation Script
**Location:** `${CLAUDE_PLUGIN_ROOT}/scripts/validate-adf-pipelines.ps1`
**Basic usage:**
```powershell
# From the root of your ADF repository
pwsh -File validate-adf-pipelines.ps1
```
**With custom paths:**
```powershell
pwsh -File validate-adf-pipelines.ps1 `
-PipelinePath "path/to/pipeline" `
-DatasetPath "path/to/dataset"
```
**With strict mode (additional warnings):**
```powershell
pwsh -File validate-adf-pipelines.ps1 -Strict
```
### What the Script Validates
The automated validation script checks for issues that Microsoft's official `@microsoft/azure-data-factory-utilities` package does **NOT** validate:
1. **Activity Nesting Violations:**
- ForEach → ForEach, Until, Validation
- Until → Until, ForEach, Validation
- IfCondition → ForEach, If, IfCondition, Switch, Until, Validation
- Switch → ForEach, If, IfCondition, Switch, Until, Validation
2. **Resource Limits:**
- Pipeline activity count (max 120, warn at 100)
- Pipeline parameter count (max 50)
- Pipeline variable count (max 50)
- ForEach batchCount limit (max 50, warn at 30 in strict mode)
3. **Variable Scope Violations:**
- SetVariable in parallel ForEach (causes race conditions)
- Proper AppendVariable vs SetVariable usage
4. **Dataset Configuration Issues:**
- Missing fileName or wildcardFileName for file-based datasets
- AzureBlobFSLocation missing required fileSystem property
- Missing required properties for DelimitedText, Json, Parquet types
5. **Copy Activity Validations:**
- Source/sink type compatibility with dataset types
- Lookup activity firstRowOnly=false warnings (5000 row/4MB limits)
- Blob file dependencies (additionalColumns logging pattern)
### Integration with CI/CD
**GitHub Actions example:**
```yaml
- name: Validate ADF Pipelines
run: |
pwsh -File validate-adf-pipelines.ps1 -PipelinePath pipeline -DatasetPath dataset
shell: pwsh
```
**Azure DevOps example:**
```yaml
- task: PowerShell@2
displayName: 'Validate ADF Pipelines'
inputs:
filePath: 'validate-adf-pipelines.ps1'
arguments: '-PipelinePath pipeline -DatasetPath dataset'
pwsh: true
```
### Command Reference
Use the `/adf-validate` command to run the validation script with proper guidance:
```bash
/adf-validate
```
This command will:
1. Detect your ADF repository structure
2. Run the validation script with appropriate paths
3. Parse and explain any errors or warnings found
4. Provide specific solutions for each violation
5. Recommend next actions based on results
6. Suggest CI/CD integration patterns
### Exit Codes
- **0**: Validation passed (no errors)
- **1**: Validation failed (errors found - DO NOT DEPLOY)
### Best Practices
1. **Run validation before every commit** to catch issues early
2. **Add validation to CI/CD pipeline** to prevent invalid deployments
3. **Use strict mode during development** for additional warnings
4. **Re-validate after bulk changes** or generated pipelines
5. **Document validation exceptions** if you must bypass a warning
6. **Share validation results with team** to prevent repeated mistakes
## 🚨 CRITICAL: Enforcement Protocol
**When creating or modifying ADF pipelines:**
1. **ALWAYS validate activity nesting** against the permitted/prohibited table
2. **REJECT** any attempt to create prohibited nesting combinations
3. **SUGGEST** Execute Pipeline workaround for complex nesting needs
4. **VALIDATE** linked service authentication matches the connector type
5. **CHECK** all limits (activities, parameters, ForEach iterations, etc.)
6. **VERIFY** required properties are set (e.g., `accountKind` for managed identity)
7. **WARN** about common pitfalls specific to the connector being used
**Example Validation Response:**
```
❌ INVALID PIPELINE STRUCTURE DETECTED:
Issue: ForEach activity contains another ForEach activity
Location: Pipeline "PL_DataProcessing" → ForEach "OuterLoop" → ForEach "InnerLoop"
This violates Azure Data Factory nesting rules:
- ForEach activities support only a SINGLE level of nesting
- You CANNOT nest ForEach within ForEach
✅ RECOMMENDED SOLUTION:
Use the Execute Pipeline pattern:
1. Create a child pipeline with the inner ForEach logic
2. Replace the inner ForEach with an Execute Pipeline activity
3. Pass required parameters to the child pipeline
Would you like me to generate the refactored pipeline structure?
```
## 📚 Reference Documentation
**Official Microsoft Learn Resources:**
- Activity nesting: https://learn.microsoft.com/en-us/azure/data-factory/concepts-nested-activities
- Blob Storage connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage
- SQL Database connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database
- Pipeline limits: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#data-factory-limits
**Last Updated:** 2025-01-24 (Based on official Microsoft documentation)
This validation rules skill MUST be consulted before creating or modifying ANY Azure Data Factory pipeline to ensure compliance with platform limitations and best practices.

812
skills/databricks-2025.md Normal file
View File

@@ -0,0 +1,812 @@
---
name: databricks-2025
description: Databricks Job activity and 2025 Azure Data Factory connectors
---
## 🚨 CRITICAL GUIDELINES
### Windows File Path Requirements
**MANDATORY: Always Use Backslashes on Windows for File Paths**
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
**Examples:**
- ❌ WRONG: `D:/repos/project/file.tsx`
- ✅ CORRECT: `D:\repos\project\file.tsx`
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
### Documentation Guidelines
**NEVER create new documentation files unless explicitly requested by the user.**
- **Priority**: Update existing README.md files rather than creating new documentation
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
- **User preference**: Only create additional .md files when user specifically asks for documentation
---
# Azure Data Factory Databricks Integration 2025
## Databricks Job Activity (Recommended 2025)
**🚨 CRITICAL UPDATE (2025):** The Databricks Job activity is now the **ONLY recommended method** for orchestrating Databricks in ADF. Microsoft strongly recommends migrating from legacy Notebook, Python, and JAR activities.
### Why Databricks Job Activity?
**Old Pattern (Notebook Activity - ❌ LEGACY):**
```json
{
"name": "RunNotebook",
"type": "DatabricksNotebook", // ❌ DEPRECATED - Migrate to DatabricksJob
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
"typeProperties": {
"notebookPath": "/Users/user@example.com/MyNotebook",
"baseParameters": { "param1": "value1" }
}
}
```
**New Pattern (Databricks Job Activity - ✅ CURRENT 2025):**
```json
{
"name": "RunDatabricksWorkflow",
"type": "DatabricksJob", // ✅ CORRECT activity type (NOT DatabricksSparkJob)
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
"typeProperties": {
"jobId": "123456", // Reference existing Databricks Workflow Job
"jobParameters": { // Pass parameters to the Job
"param1": "value1",
"runDate": "@pipeline().parameters.ProcessingDate"
}
},
"policy": {
"timeout": "0.12:00:00",
"retry": 2,
"retryIntervalInSeconds": 30
}
}
```
### Benefits of Databricks Job Activity (2025)
1. **Serverless Execution by Default:**
- ✅ No cluster specification needed in linked service
- ✅ Automatically runs on Databricks serverless compute
- ✅ Faster startup times and lower costs
- ✅ Managed infrastructure by Databricks
2. **Advanced Workflow Features:**
-**Run As** - Execute jobs as specific users/service principals
-**Task Values** - Pass data between tasks within workflow
-**Conditional Execution** - If/Else and For Each task types
-**AI/BI Tasks** - Model serving endpoints, Power BI semantic models
-**Repair Runs** - Rerun failed tasks without reprocessing successful ones
-**Notifications/Alerts** - Built-in alerting on job failures
-**Git Integration** - Version control for notebooks and code
-**DABs Support** - Databricks Asset Bundles for deployment
-**Built-in Lineage** - Data lineage tracking across tasks
-**Queuing and Concurrent Runs** - Better resource management
3. **Centralized Job Management:**
- Jobs defined once in Databricks workspace
- Single source of truth for all environments
- Versioning through Databricks (Git-backed)
- Consistent across orchestration tools
4. **Better Orchestration:**
- Complex task dependencies within Job
- Multiple heterogeneous tasks (notebook, Python, SQL, Delta Live Tables)
- Job-level monitoring and logging
- Parameter passing between tasks
5. **Improved Reliability:**
- Retry logic at Job and task level
- Better error handling and recovery
- Automatic cluster management
6. **Cost Optimization:**
- Serverless compute (pay only for execution)
- Job clusters (auto-terminating)
- Optimized cluster sizing per task
- Spot instance support
### Implementation
#### 1. Create Databricks Job
```python
# In Databricks workspace
# Create Job with tasks
{
"name": "Data Processing Job",
"tasks": [
{
"task_key": "ingest",
"notebook_task": {
"notebook_path": "/Notebooks/Ingest",
"base_parameters": {}
},
"job_cluster_key": "small_cluster"
},
{
"task_key": "transform",
"depends_on": [{ "task_key": "ingest" }],
"notebook_task": {
"notebook_path": "/Notebooks/Transform"
},
"job_cluster_key": "medium_cluster"
},
{
"task_key": "load",
"depends_on": [{ "task_key": "transform" }],
"notebook_task": {
"notebook_path": "/Notebooks/Load"
},
"job_cluster_key": "small_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "small_cluster",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
}
},
{
"job_cluster_key": "medium_cluster",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"num_workers": 8
}
}
]
}
# Get Job ID after creation
```
#### 2. Create ADF Pipeline with Databricks Job Activity (2025)
```json
{
"name": "PL_Databricks_Serverless_Workflow",
"properties": {
"activities": [
{
"name": "ExecuteDatabricksWorkflow",
"type": "DatabricksJob", // ✅ Correct activity type
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 2,
"retryIntervalInSeconds": 30
},
"typeProperties": {
"jobId": "123456", // Databricks Job ID from workspace
"jobParameters": { // ⚠️ Use jobParameters (not parameters)
"input_path": "/mnt/data/input",
"output_path": "/mnt/data/output",
"run_date": "@pipeline().parameters.runDate",
"environment": "@pipeline().parameters.environment"
}
},
"linkedServiceName": {
"referenceName": "DatabricksLinkedService_Serverless",
"type": "LinkedServiceReference"
}
},
{
"name": "LogJobExecution",
"type": "WebActivity",
"dependsOn": [
{
"activity": "ExecuteDatabricksWorkflow",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"url": "@pipeline().parameters.LoggingEndpoint",
"method": "POST",
"body": {
"jobId": "123456",
"runId": "@activity('ExecuteDatabricksWorkflow').output.runId",
"status": "Succeeded",
"duration": "@activity('ExecuteDatabricksWorkflow').output.executionDuration"
}
}
}
],
"parameters": {
"runDate": {
"type": "string",
"defaultValue": "@utcnow()"
},
"environment": {
"type": "string",
"defaultValue": "production"
},
"LoggingEndpoint": {
"type": "string"
}
}
}
}
```
#### 3. Configure Linked Service (2025 - Serverless)
**✅ RECOMMENDED: Serverless Linked Service (No Cluster Configuration)**
```json
{
"name": "DatabricksLinkedService_Serverless",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-123456789.azuredatabricks.net",
"authentication": "MSI" // ✅ Managed Identity (recommended 2025)
// ⚠️ NO existingClusterId or newClusterNodeType needed for serverless!
// The Databricks Job activity automatically uses serverless compute
}
}
}
```
**Alternative: Access Token Authentication**
```json
{
"name": "DatabricksLinkedService_Token",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-123456789.azuredatabricks.net",
"accessToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "databricks-access-token"
}
}
}
}
```
**🚨 CRITICAL: For Databricks Job activity, DO NOT specify cluster properties in the linked service. The job configuration in Databricks workspace controls compute resources.**
## 🆕 2025 New Connectors and Enhancements
### ServiceNow V2 Connector (RECOMMENDED - V1 End of Support)
**🚨 CRITICAL: ServiceNow V1 connector is at End of Support stage. Migrate to V2 immediately!**
**Key Features of V2:**
-**Native Query Builder** - Aligns with ServiceNow's condition builder experience
-**Enhanced Performance** - Optimized data extraction
-**Better Error Handling** - Improved diagnostics and retry logic
-**OData Support** - Modern API integration patterns
**Copy Activity Example:**
```json
{
"name": "CopyFromServiceNowV2",
"type": "Copy",
"inputs": [
{
"referenceName": "ServiceNowV2Source",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ServiceNowV2Source",
"query": "sysparm_query=active=true^priority=1^sys_created_on>=javascript:gs.dateGenerate('2025-01-01')",
"httpRequestTimeout": "00:01:40" // 100 seconds
},
"sink": {
"type": "AzureSqlSink",
"writeBehavior": "upsert",
"upsertSettings": {
"useTempDB": true,
"keys": ["sys_id"]
}
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
}
}
}
}
```
**Linked Service (OAuth2 - Recommended):**
```json
{
"name": "ServiceNowV2LinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "ServiceNowV2",
"typeProperties": {
"endpoint": "https://dev12345.service-now.com",
"authenticationType": "OAuth2",
"clientId": "your-oauth-client-id",
"clientSecret": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-client-secret"
},
"username": "service-account@company.com",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-password"
},
"grantType": "password"
}
}
}
```
**Linked Service (Basic Authentication - Legacy):**
```json
{
"name": "ServiceNowV2LinkedService_Basic",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "ServiceNowV2",
"typeProperties": {
"endpoint": "https://dev12345.service-now.com",
"authenticationType": "Basic",
"username": "admin",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-password"
}
}
}
}
```
**Migration from V1 to V2:**
1. Update linked service type from `ServiceNow` to `ServiceNowV2`
2. Update source type from `ServiceNowSource` to `ServiceNowV2Source`
3. Test queries in ServiceNow UI's condition builder first
4. Adjust timeout settings if needed (V2 may have different performance)
### Enhanced PostgreSQL Connector
Improved performance and features:
```json
{
"name": "PostgreSQLLinkedService",
"type": "PostgreSql",
"typeProperties": {
"connectionString": "host=myserver.postgres.database.azure.com;port=5432;database=mydb;uid=myuser",
"password": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "postgres-password"
},
// 2025 enhancement
"enableSsl": true,
"sslMode": "Require"
}
}
```
### Microsoft Fabric Warehouse Connector (NEW 2025)
**🆕 Native support for Microsoft Fabric Warehouse (Q3 2024+)**
**Supported Activities:**
- ✅ Copy Activity (source and sink)
- ✅ Lookup Activity
- ✅ Get Metadata Activity
- ✅ Script Activity
- ✅ Stored Procedure Activity
**Linked Service Configuration:**
```json
{
"name": "FabricWarehouseLinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse", // ✅ NEW dedicated Fabric Warehouse type
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "ServicePrincipal", // Recommended
"servicePrincipalId": "<app-registration-id>",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "fabric-warehouse-sp-key"
},
"tenant": "<tenant-id>"
}
}
}
```
**Alternative: Managed Identity Authentication (Preferred)**
```json
{
"name": "FabricWarehouseLinkedService_ManagedIdentity",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse",
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "SystemAssignedManagedIdentity"
}
}
}
```
**Copy Activity Example:**
```json
{
"name": "CopyToFabricWarehouse",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "FabricWarehouseSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "WarehouseSink",
"writeBehavior": "insert", // or "upsert"
"writeBatchSize": 10000,
"tableOption": "autoCreate" // Auto-create table if not exists
},
"enableStaging": true, // Recommended for large data
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"path": "staging/fabric-warehouse"
},
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "name": "CustomerID" },
"sink": { "name": "customer_id" }
}
]
}
}
}
```
**Best Practices for Fabric Warehouse:**
- ✅ Use managed identity for authentication (no secret rotation)
- ✅ Enable staging for large data loads (> 1GB)
- ✅ Use `tableOption: autoCreate` for dynamic schema creation
- ✅ Leverage Fabric's lakehouse integration for unified analytics
- ✅ Monitor Fabric capacity units (CU) consumption
### Enhanced Snowflake Connector
Improved performance:
```json
{
"name": "SnowflakeLinkedService",
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://myaccount.snowflakecomputing.com",
"database": "mydb",
"warehouse": "mywarehouse",
"authenticationType": "KeyPair",
"username": "myuser",
"privateKey": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "snowflake-private-key"
},
"privateKeyPassphrase": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "snowflake-passphrase"
}
}
}
```
## Managed Identity for Azure Storage (2025)
### Azure Table Storage
Now supports system-assigned and user-assigned managed identity:
```json
{
"name": "AzureTableStorageLinkedService",
"type": "AzureTableStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.table.core.windows.net",
"authenticationType": "ManagedIdentity" // New in 2025
// Or user-assigned:
// "credential": {
// "referenceName": "UserAssignedManagedIdentity"
// }
}
}
```
### Azure Files
Now supports managed identity authentication:
```json
{
"name": "AzureFilesLinkedService",
"type": "AzureFileStorage",
"typeProperties": {
"fileShare": "myshare",
"accountName": "mystorageaccount",
"authenticationType": "ManagedIdentity" // New in 2025
}
}
```
## Mapping Data Flows - Spark 3.3
Spark 3.3 now powers Mapping Data Flows:
**Performance Improvements:**
- 30% faster data processing
- Improved memory management
- Better partition handling
- Enhanced join performance
**New Features:**
- Adaptive Query Execution (AQE)
- Dynamic partition pruning
- Improved caching
- Better column statistics
```json
{
"name": "DataFlow1",
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": { "referenceName": "SourceDataset" }
}
],
"transformations": [
{
"name": "Transform1"
}
],
"sinks": [
{
"dataset": { "referenceName": "SinkDataset" }
}
]
}
}
```
## Azure DevOps Server 2022 Support
Git integration now supports on-premises Azure DevOps Server 2022:
```json
{
"name": "DataFactory",
"properties": {
"repoConfiguration": {
"type": "AzureDevOpsGit",
"accountName": "on-prem-ado-server",
"projectName": "MyProject",
"repositoryName": "adf-repo",
"collaborationBranch": "main",
"rootFolder": "/",
"hostName": "https://ado-server.company.com" // On-premises server
}
}
}
```
## 🔐 Managed Identity 2025 Best Practices
### User-Assigned vs System-Assigned Managed Identity
**System-Assigned Managed Identity:**
```json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
"accountKind": "StorageV2"
// ✅ Uses Data Factory's system-assigned identity automatically
}
}
```
**User-Assigned Managed Identity (NEW 2025):**
```json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
"accountKind": "StorageV2",
"credential": {
"referenceName": "UserAssignedManagedIdentityCredential",
"type": "CredentialReference"
}
}
}
```
**When to Use User-Assigned:**
- ✅ Sharing identity across multiple data factories
- ✅ Complex multi-environment setups
- ✅ Granular permission management
- ✅ Identity lifecycle independent of data factory
**Credential Consolidation (NEW 2025):**
ADF now supports a centralized **Credentials** feature:
```json
{
"name": "ManagedIdentityCredential",
"type": "Microsoft.DataFactory/factories/credentials",
"properties": {
"type": "ManagedIdentity",
"typeProperties": {
"resourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}"
}
}
}
```
**Benefits:**
- ✅ Consolidate all Microsoft Entra ID-based credentials in one place
- ✅ Reuse credentials across multiple linked services
- ✅ Centralized permission management
- ✅ Easier audit and compliance tracking
### MFA Enforcement Compatibility (October 2025)
**🚨 IMPORTANT: Azure requires MFA for all users by October 2025**
**Impact on ADF:**
-**Managed identities are UNAFFECTED** - No MFA required for service accounts
- ✅ Continue using system-assigned and user-assigned identities without changes
-**Interactive user logins affected** - Personal Azure AD accounts need MFA
-**Service principals with certificate auth** - Recommended alternative to secrets
**Best Practice:**
```json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "myserver.database.windows.net",
"database": "mydb",
"authenticationType": "SystemAssignedManagedIdentity"
// ✅ No MFA needed, no secret rotation, passwordless
}
}
```
### Principle of Least Privilege (2025)
**Storage Blob Data Roles:**
- `Storage Blob Data Reader` - Read-only access (source)
- `Storage Blob Data Contributor` - Read/write access (sink)
- ❌ Avoid `Storage Blob Data Owner` unless needed
**SQL Database Roles:**
```sql
-- Create contained database user for managed identity
CREATE USER [datafactory-name] FROM EXTERNAL PROVIDER;
-- Grant minimal required permissions
ALTER ROLE db_datareader ADD MEMBER [datafactory-name];
ALTER ROLE db_datawriter ADD MEMBER [datafactory-name];
-- ❌ Avoid db_owner unless truly needed
```
**Key Vault Access Policies:**
```json
{
"permissions": {
"secrets": ["Get"] // ✅ Only Get permission needed
// ❌ Don't grant List, Set, Delete unless required
}
}
```
## Best Practices (2025)
1. **Use Databricks Job Activity (MANDATORY):**
- ❌ STOP using Notebook, Python, JAR activities
- ✅ Migrate to DatabricksJob activity immediately
- ✅ Define workflows in Databricks workspace
- ✅ Leverage serverless compute (no cluster config needed)
- ✅ Utilize advanced features (Run As, Task Values, If/Else, Repair Runs)
2. **Managed Identity Authentication (MANDATORY 2025):**
- ✅ Use managed identities for ALL Azure resources
- ✅ Prefer system-assigned for simple scenarios
- ✅ Use user-assigned for shared identity needs
- ✅ Leverage Credentials feature for consolidation
- ✅ MFA-compliant for October 2025 enforcement
- ❌ Avoid access keys and connection strings
- ✅ Store any remaining secrets in Key Vault
3. **Monitor Job Execution:**
- Track Databricks Job run IDs from ADF output
- Log Job parameters for auditability
- Set up alerts for job failures
- Use Databricks job-level monitoring
- Leverage built-in lineage tracking
4. **Optimize Spark 3.3 Usage (Data Flows):**
- Enable Adaptive Query Execution (AQE)
- Use appropriate partition counts (4-8 per core)
- Monitor execution plans in Databricks
- Use broadcast joins for small dimensions
- Implement dynamic partition pruning
## Resources
- [Databricks Job Activity](https://learn.microsoft.com/azure/data-factory/transform-data-using-databricks-spark-job)
- [ADF Connectors](https://learn.microsoft.com/azure/data-factory/connector-overview)
- [Managed Identity Authentication](https://learn.microsoft.com/azure/data-factory/data-factory-service-identity)
- [Mapping Data Flows](https://learn.microsoft.com/azure/data-factory/concepts-data-flow-overview)

View File

@@ -0,0 +1,782 @@
---
name: fabric-onelake-2025
description: Microsoft Fabric Lakehouse, OneLake, and Fabric Warehouse connectors for Azure Data Factory (2025)
---
## 🚨 CRITICAL GUIDELINES
### Windows File Path Requirements
**MANDATORY: Always Use Backslashes on Windows for File Paths**
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
**Examples:**
- ❌ WRONG: `D:/repos/project/file.tsx`
- ✅ CORRECT: `D:\repos\project\file.tsx`
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
### Documentation Guidelines
**NEVER create new documentation files unless explicitly requested by the user.**
- **Priority**: Update existing README.md files rather than creating new documentation
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
- **User preference**: Only create additional .md files when user specifically asks for documentation
---
# Microsoft Fabric Integration with Azure Data Factory (2025)
## Overview
Microsoft Fabric represents a unified SaaS analytics platform that combines Power BI, Azure Synapse Analytics, and Azure Data Factory capabilities. Azure Data Factory now provides native connectors for Fabric Lakehouse and Fabric Warehouse, enabling seamless data movement between ADF and Fabric workspaces.
## Microsoft Fabric Lakehouse Connector
The Fabric Lakehouse connector enables both read and write operations to Microsoft Fabric Lakehouse for tables and files.
### Supported Activities
- ✅ Copy Activity (source and sink)
- ✅ Lookup Activity
- ✅ Get Metadata Activity
- ✅ Delete Activity
### Linked Service Configuration
**Using Service Principal Authentication (Recommended):**
```json
{
"name": "FabricLakehouseLinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Lakehouse",
"typeProperties": {
"workspaceId": "12345678-1234-1234-1234-123456789abc",
"artifactId": "87654321-4321-4321-4321-cba987654321",
"servicePrincipalId": "<app-registration-client-id>",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "fabric-service-principal-key"
},
"tenant": "<tenant-id>"
}
}
}
```
**Using Managed Identity Authentication (Preferred 2025):**
```json
{
"name": "FabricLakehouseLinkedService_ManagedIdentity",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Lakehouse",
"typeProperties": {
"workspaceId": "12345678-1234-1234-1234-123456789abc",
"artifactId": "87654321-4321-4321-4321-cba987654321"
// Managed identity used automatically - no credentials needed!
}
}
}
```
**Finding Workspace and Artifact IDs:**
1. Navigate to Fabric workspace in browser
2. Copy workspace ID from URL: `https://app.powerbi.com/groups/<workspaceId>/...`
3. Open Lakehouse settings to find artifact ID
4. Or use Fabric REST API to enumerate workspace items
### Dataset Configuration
**For Lakehouse Files:**
```json
{
"name": "FabricLakehouseFiles",
"properties": {
"type": "LakehouseTable",
"linkedServiceName": {
"referenceName": "FabricLakehouseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"table": "Files/raw/sales/2025"
}
}
}
```
**For Lakehouse Tables:**
```json
{
"name": "FabricLakehouseTables",
"properties": {
"type": "LakehouseTable",
"linkedServiceName": {
"referenceName": "FabricLakehouseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"table": "SalesData" // Table name in Lakehouse
}
}
}
```
### Copy Activity Examples
**Copy from Azure SQL to Fabric Lakehouse:**
```json
{
"name": "CopyToFabricLakehouse",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "FabricLakehouseTables",
"type": "DatasetReference",
"parameters": {
"tableName": "DimCustomer"
}
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM dbo.Customers WHERE ModifiedDate > '@{pipeline().parameters.LastRunTime}'"
},
"sink": {
"type": "LakehouseTableSink",
"tableActionOption": "append" // or "overwrite"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "name": "CustomerID" },
"sink": { "name": "customer_id", "type": "Int32" }
},
{
"source": { "name": "CustomerName" },
"sink": { "name": "customer_name", "type": "String" }
}
]
}
}
}
```
**Copy Parquet Files to Fabric Lakehouse:**
```json
{
"name": "CopyParquetToLakehouse",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureBlobParquetFiles",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "FabricLakehouseFiles",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "raw/sales/2025",
"wildcardFileName": "*.parquet"
}
},
"sink": {
"type": "LakehouseFileSink",
"storeSettings": {
"type": "LakehouseWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
```
### Lookup Activity Example
```json
{
"name": "LookupFabricLakehouseTable",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "LakehouseTableSource",
"query": "SELECT MAX(LastUpdated) as MaxDate FROM SalesData"
},
"dataset": {
"referenceName": "FabricLakehouseTables",
"type": "DatasetReference"
}
}
}
```
## Microsoft Fabric Warehouse Connector
The Fabric Warehouse connector provides T-SQL based data warehousing capabilities within the Fabric ecosystem.
### Supported Activities
- ✅ Copy Activity (source and sink)
- ✅ Lookup Activity
- ✅ Get Metadata Activity
- ✅ Script Activity
- ✅ Stored Procedure Activity
### Linked Service Configuration
**Using Service Principal:**
```json
{
"name": "FabricWarehouseLinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse",
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "ServicePrincipal",
"servicePrincipalId": "<app-registration-id>",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "fabric-warehouse-sp-key"
},
"tenant": "<tenant-id>"
}
}
}
```
**Using System-Assigned Managed Identity (Recommended):**
```json
{
"name": "FabricWarehouseLinkedService_SystemMI",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse",
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "SystemAssignedManagedIdentity"
}
}
}
```
**Using User-Assigned Managed Identity:**
```json
{
"name": "FabricWarehouseLinkedService_UserMI",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse",
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "UserAssignedManagedIdentity",
"credential": {
"referenceName": "UserAssignedManagedIdentityCredential",
"type": "CredentialReference"
}
}
}
}
```
### Copy Activity to Fabric Warehouse
**Bulk Insert Pattern:**
```json
{
"name": "CopyToFabricWarehouse",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "FabricWarehouseSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM dbo.FactSales WHERE OrderDate >= '@{pipeline().parameters.StartDate}'"
},
"sink": {
"type": "WarehouseSink",
"preCopyScript": "TRUNCATE TABLE staging.FactSales",
"writeBehavior": "insert",
"writeBatchSize": 10000,
"tableOption": "autoCreate", // Auto-create table if doesn't exist
"disableMetricsCollection": false
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"path": "staging/fabric-warehouse",
"enableCompression": true
},
"parallelCopies": 4,
"dataIntegrationUnits": 8
}
}
```
**Upsert Pattern:**
```json
{
"sink": {
"type": "WarehouseSink",
"writeBehavior": "upsert",
"upsertSettings": {
"useTempDB": true,
"keys": ["customer_id"],
"interimSchemaName": "staging"
},
"writeBatchSize": 10000
}
}
```
### Stored Procedure Activity
```json
{
"name": "ExecuteFabricWarehouseStoredProcedure",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "FabricWarehouseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "dbo.usp_ProcessSalesData",
"storedProcedureParameters": {
"StartDate": {
"value": "@pipeline().parameters.StartDate",
"type": "DateTime"
},
"EndDate": {
"value": "@pipeline().parameters.EndDate",
"type": "DateTime"
}
}
}
}
```
### Script Activity
```json
{
"name": "ExecuteFabricWarehouseScript",
"type": "Script",
"linkedServiceName": {
"referenceName": "FabricWarehouseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scripts": [
{
"type": "Query",
"text": "DELETE FROM staging.FactSales WHERE LoadDate < DATEADD(day, -30, GETDATE())"
},
{
"type": "Query",
"text": "UPDATE dbo.FactSales SET ProcessedFlag = 1 WHERE OrderDate = '@{pipeline().parameters.ProcessDate}'"
}
],
"scriptBlockExecutionTimeout": "02:00:00"
}
}
```
## OneLake Integration Patterns
### Pattern 1: Azure Data Lake Gen2 to OneLake via Shortcuts
**Concept:** Use OneLake shortcuts instead of copying data
OneLake shortcuts allow you to reference data in Azure Data Lake Gen2 without physically copying it:
1. In Fabric Lakehouse, create shortcut to ADLS Gen2 container
2. Data appears in OneLake immediately (zero-copy)
3. Use ADF to orchestrate transformations on shortcut data
4. Write results back to OneLake
**Benefits:**
- Zero data duplication
- Real-time data access
- Reduced storage costs
- Single source of truth
**ADF Pipeline Pattern:**
```json
{
"name": "PL_Process_Shortcut_Data",
"activities": [
{
"name": "TransformShortcutData",
"type": "ExecuteDataFlow",
"typeProperties": {
"dataFlow": {
"referenceName": "DF_Transform",
"type": "DataFlowReference"
},
"compute": {
"coreCount": 8,
"computeType": "General"
}
}
},
{
"name": "WriteToCuratedZone",
"type": "Copy",
"typeProperties": {
"source": {
"type": "ParquetSource"
},
"sink": {
"type": "LakehouseTableSink",
"tableActionOption": "overwrite"
}
}
}
]
}
```
### Pattern 2: Incremental Load to Fabric Lakehouse
```json
{
"name": "PL_Incremental_Load_To_Fabric",
"activities": [
{
"name": "GetLastWatermark",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "LakehouseTableSource",
"query": "SELECT MAX(LoadTimestamp) as LastLoad FROM ControlTable"
}
}
},
{
"name": "CopyIncrementalData",
"type": "Copy",
"dependsOn": [
{
"activity": "GetLastWatermark",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM dbo.Orders WHERE ModifiedDate > '@{activity('GetLastWatermark').output.firstRow.LastLoad}'"
},
"sink": {
"type": "LakehouseTableSink",
"tableActionOption": "append"
}
}
},
{
"name": "UpdateWatermark",
"type": "Script",
"dependsOn": [
{
"activity": "CopyIncrementalData",
"dependencyConditions": ["Succeeded"]
}
],
"linkedServiceName": {
"referenceName": "FabricLakehouseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scripts": [
{
"type": "Query",
"text": "INSERT INTO ControlTable VALUES ('@{utcnow()}')"
}
]
}
}
]
}
```
### Pattern 3: Cross-Platform Pipeline with Invoke Pipeline
**NEW 2025: Invoke Pipeline Activity for Cross-Platform Calls**
```json
{
"name": "PL_ADF_Orchestrates_Fabric_Pipeline",
"activities": [
{
"name": "PrepareDataInADF",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "LakehouseTableSink"
}
}
},
{
"name": "InvokeFabricPipeline",
"type": "InvokePipeline",
"dependsOn": [
{
"activity": "PrepareDataInADF",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"workspaceId": "12345678-1234-1234-1234-123456789abc",
"pipelineId": "87654321-4321-4321-4321-cba987654321",
"waitOnCompletion": true,
"parameters": {
"processDate": "@pipeline().parameters.RunDate",
"environment": "production"
}
}
}
]
}
```
## Permission Configuration
### Azure Data Factory Managed Identity Permissions in Fabric
**For Fabric Lakehouse:**
1. Open Fabric workspace
2. Go to Workspace settings → Manage access
3. Add ADF managed identity with **Contributor** role
4. Or assign **Workspace Admin** for full access
**For Fabric Warehouse:**
1. Navigate to Warehouse SQL endpoint
2. Execute SQL to create user:
```sql
CREATE USER [your-adf-name] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [your-adf-name];
ALTER ROLE db_datawriter ADD MEMBER [your-adf-name];
```
### Service Principal Permissions
**App Registration Setup:**
1. Register app in Microsoft Entra ID
2. Create client secret (store in Key Vault)
3. Add app to Fabric workspace with Contributor role
4. For Warehouse, create SQL user as shown above
## Best Practices (2025)
### 1. Use Managed Identity
- ✅ System-assigned for single ADF
- ✅ User-assigned for multiple ADFs
- ❌ Avoid service principal keys when possible
- ✅ Store any secrets in Key Vault
### 2. Enable Staging for Large Loads
```json
{
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"path": "staging/fabric-loads",
"enableCompression": true
}
}
```
**When to Stage:**
- Data volume > 1 GB
- Complex transformations
- Loading to Fabric Warehouse
- Need better performance
### 3. Leverage OneLake Shortcuts
**Instead of:**
```
ADLS Gen2 → [Copy Activity] → Fabric Lakehouse
```
**Use:**
```
ADLS Gen2 → [OneLake Shortcut] → Direct Access in Fabric
```
**Benefits:**
- No data movement
- Instant availability
- Reduced ADF costs
- Lower storage costs
### 4. Monitor Fabric Capacity Units (CU)
Fabric uses capacity-based pricing. Monitor:
- CU consumption per pipeline run
- Peak usage times
- Throttling events
- Optimize by:
- Using incremental loads
- Scheduling during off-peak
- Right-sizing copy parallelism
### 5. Use Table Option AutoCreate
```json
{
"sink": {
"type": "WarehouseSink",
"tableOption": "autoCreate" // Creates table if missing
}
}
```
**Benefits:**
- No manual schema management
- Automatic type mapping
- Faster development
- Works for dynamic schemas
### 6. Implement Error Handling
```json
{
"activities": [
{
"name": "CopyToFabric",
"type": "Copy",
"policy": {
"retry": 2,
"retryIntervalInSeconds": 30,
"timeout": "0.12:00:00"
}
},
{
"name": "LogFailure",
"type": "WebActivity",
"dependsOn": [
{
"activity": "CopyToFabric",
"dependencyConditions": ["Failed"]
}
],
"typeProperties": {
"url": "@pipeline().parameters.LoggingEndpoint",
"method": "POST",
"body": {
"error": "@activity('CopyToFabric').error.message",
"pipeline": "@pipeline().Pipeline"
}
}
}
]
}
```
## Common Issues and Solutions
### Issue 1: Permission Denied
**Error:** "User does not have permission to access Fabric workspace"
**Solution:**
1. Verify ADF managed identity added to Fabric workspace
2. Check role is **Contributor** or higher
3. For Warehouse, verify SQL user created
4. Allow up to 5 minutes for permission propagation
### Issue 2: Endpoint Not Found
**Error:** "Unable to connect to endpoint"
**Solution:**
1. Verify `workspaceId` and `artifactId` are correct
2. Check Fabric workspace URL in browser
3. Ensure Lakehouse/Warehouse is not paused
4. Verify firewall rules allow ADF IP ranges
### Issue 3: Schema Mismatch
**Error:** "Column types do not match"
**Solution:**
1. Use `tableOption: "autoCreate"` for initial load
2. Explicitly define column mappings in translator
3. Enable staging for complex transformations
4. Use Data Flow for schema evolution
### Issue 4: Performance Degradation
**Symptoms:** Slow copy performance to Fabric
**Solutions:**
1. Enable staging for large datasets
2. Increase `parallelCopies` (try 4-8)
3. Use appropriate `dataIntegrationUnits` (8-32)
4. Check Fabric capacity unit throttling
5. Schedule during off-peak hours
## Resources
- [Fabric Lakehouse Connector](https://learn.microsoft.com/azure/data-factory/connector-microsoft-fabric-lakehouse)
- [Fabric Warehouse Connector](https://learn.microsoft.com/azure/data-factory/connector-microsoft-fabric-warehouse)
- [OneLake Documentation](https://learn.microsoft.com/fabric/onelake/)
- [Fabric Capacity Management](https://learn.microsoft.com/fabric/enterprise/licenses)
- [ADF to Fabric Integration Guide](https://learn.microsoft.com/fabric/data-factory/how-to-ingest-data-into-fabric-from-azure-data-factory)
This comprehensive guide enables seamless integration between Azure Data Factory and Microsoft Fabric's modern data platform capabilities.

View File

@@ -0,0 +1,480 @@
---
name: windows-git-bash-compatibility
description: Windows and Git Bash compatibility guidance for Azure Data Factory development and CI/CD
---
# Windows & Git Bash Compatibility for Azure Data Factory
## Overview
Azure Data Factory development frequently occurs on Windows machines using Git Bash (MINGW64) as the primary shell. This introduces path conversion challenges that can break CI/CD pipelines, npm commands, and deployment scripts.
## Git Bash Path Conversion Behavior
### Automatic Path Conversion
Git Bash (MINGW) automatically converts Unix-style paths to Windows paths:
**Conversions:**
- `/foo``C:/Program Files/Git/usr/foo`
- `/foo:/bar``C:\msys64\foo;C:\msys64\bar` (path lists)
- `--dir=/foo``--dir=C:/msys64/foo` (arguments)
**What Triggers Conversion:**
- Leading forward slash (`/`) in arguments
- Colon-separated path lists
- Arguments after `-` or `,` with path components
**What's Exempt:**
- Arguments containing `=` (variable assignments)
- Drive specifiers (`C:`)
- Arguments with `;` (already Windows format)
- Arguments starting with `//` (Windows switches)
## ADF-Specific Path Issues
### npm Build Commands
**Problem:**
```bash
# This fails in Git Bash due to path conversion
npm run build validate ./adf-resources /subscriptions/abc/resourceGroups/rg/providers/Microsoft.DataFactory/factories/myFactory
# Path gets converted incorrectly
```
**Solution:**
```bash
# Disable path conversion before running
export MSYS_NO_PATHCONV=1
npm run build validate ./adf-resources /subscriptions/abc/resourceGroups/rg/providers/Microsoft.DataFactory/factories/myFactory
# Or wrap the command
MSYS_NO_PATHCONV=1 npm run build export ./adf-resources /subscriptions/.../myFactory "ARMTemplate"
```
### PowerShell Scripts
**Problem:**
```bash
# Calling PowerShell scripts from Git Bash
pwsh ./PrePostDeploymentScript.Ver2.ps1 -armTemplate "./ARMTemplate/ARMTemplateForFactory.json"
# Path conversion may interfere
```
**Solution:**
```bash
# Disable conversion for PowerShell calls
MSYS_NO_PATHCONV=1 pwsh ./PrePostDeploymentScript.Ver2.ps1 -armTemplate "./ARMTemplate/ARMTemplateForFactory.json"
```
### ARM Template Paths
**Problem:**
```bash
# Azure CLI deployment from Git Bash
az deployment group create \
--resource-group myRG \
--template-file ARMTemplate/ARMTemplateForFactory.json # Path may get converted
```
**Solution:**
```bash
# Use relative paths with ./ prefix or absolute Windows paths
export MSYS_NO_PATHCONV=1
az deployment group create \
--resource-group myRG \
--template-file ./ARMTemplate/ARMTemplateForFactory.json
```
## Shell Detection Patterns
### Bash Shell Detection
```bash
#!/usr/bin/env bash
# Method 1: Check $MSYSTEM (Git Bash/MSYS2 specific)
if [ -n "$MSYSTEM" ]; then
echo "Running in Git Bash/MinGW ($MSYSTEM)"
export MSYS_NO_PATHCONV=1
fi
# Method 2: Check uname -s (more portable)
case "$(uname -s)" in
MINGW64*|MINGW32*|MSYS*)
echo "Git Bash detected"
export MSYS_NO_PATHCONV=1
;;
Linux*)
if grep -q Microsoft /proc/version 2>/dev/null; then
echo "WSL detected"
else
echo "Native Linux"
fi
;;
Darwin*)
echo "macOS"
;;
esac
# Method 3: Check $OSTYPE (bash-specific)
case "$OSTYPE" in
msys*)
echo "Git Bash/MSYS"
export MSYS_NO_PATHCONV=1
;;
linux-gnu*)
echo "Linux"
;;
darwin*)
echo "macOS"
;;
esac
```
### Node.js Shell Detection
```javascript
// detect-shell.js - For use in npm scripts or Node tools
function detectShell() {
const env = process.env;
// Git Bash/MinGW (MOST RELIABLE)
if (env.MSYSTEM) {
return {
type: 'mingw',
subsystem: env.MSYSTEM, // MINGW64, MINGW32, or MSYS
needsPathFix: true
};
}
// WSL
if (env.WSL_DISTRO_NAME) {
return {
type: 'wsl',
distro: env.WSL_DISTRO_NAME,
needsPathFix: false
};
}
// PowerShell (3+ paths in PSModulePath)
if (env.PSModulePath?.split(';').length >= 3) {
return {
type: 'powershell',
needsPathFix: false
};
}
// CMD
if (process.platform === 'win32' && env.PROMPT === '$P$G') {
return {
type: 'cmd',
needsPathFix: false
};
}
// Cygwin
if (env.TERM === 'cygwin') {
return {
type: 'cygwin',
needsPathFix: true
};
}
// Unix shells
if (env.SHELL?.includes('bash')) {
return { type: 'bash', needsPathFix: false };
}
if (env.SHELL?.includes('zsh')) {
return { type: 'zsh', needsPathFix: false };
}
return {
type: 'unknown',
platform: process.platform,
needsPathFix: false
};
}
// Usage
const shell = detectShell();
console.log(`Detected shell: ${shell.type}`);
if (shell.needsPathFix) {
process.env.MSYS_NO_PATHCONV = '1';
console.log('Path conversion disabled for Git Bash compatibility');
}
module.exports = { detectShell };
```
### PowerShell Detection
```powershell
# Detect PowerShell edition and version
function Get-ShellInfo {
$info = @{
Edition = $PSVersionTable.PSEdition
Version = $PSVersionTable.PSVersion
OS = $PSVersionTable.OS
Platform = $PSVersionTable.Platform
}
if ($info.Edition -eq 'Core') {
Write-Host "PowerShell Core (pwsh) - Cross-platform compatible" -ForegroundColor Green
$info.CrossPlatform = $true
} else {
Write-Host "Windows PowerShell - Windows only" -ForegroundColor Yellow
$info.CrossPlatform = $false
}
return $info
}
$shellInfo = Get-ShellInfo
```
## CI/CD Pipeline Patterns
### Local Development Scripts
**validate-adf.sh** (Git Bash compatible):
```bash
#!/usr/bin/env bash
set -e
# Detect and handle Git Bash
if [ -n "$MSYSTEM" ]; then
export MSYS_NO_PATHCONV=1
echo "🔧 Git Bash detected - path conversion disabled"
fi
# Configuration
ADF_ROOT="./adf-resources"
FACTORY_ID="/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourceGroups/${RESOURCE_GROUP}/providers/Microsoft.DataFactory/factories/${FACTORY_NAME}"
# Validate ADF resources
echo "📋 Validating ADF resources..."
npm run build validate "$ADF_ROOT" "$FACTORY_ID"
# Generate ARM templates
echo "📦 Generating ARM templates..."
npm run build export "$ADF_ROOT" "$FACTORY_ID" "ARMTemplate"
echo "✅ Validation complete"
```
**deploy-adf.sh** (Cross-platform):
```bash
#!/usr/bin/env bash
set -e
# Detect shell
detect_shell() {
if [ -n "$MSYSTEM" ]; then echo "git-bash"
elif [ -n "$WSL_DISTRO_NAME" ]; then echo "wsl"
elif [[ "$OSTYPE" == "darwin"* ]]; then echo "macos"
else echo "linux"
fi
}
SHELL_TYPE=$(detect_shell)
echo "🖥️ Detected shell: $SHELL_TYPE"
# Handle Git Bash
if [ "$SHELL_TYPE" = "git-bash" ]; then
export MSYS_NO_PATHCONV=1
fi
# Download PrePostDeploymentScript
curl -sLo PrePostDeploymentScript.Ver2.ps1 \
https://raw.githubusercontent.com/Azure/Azure-DataFactory/main/SamplesV2/ContinuousIntegrationAndDelivery/PrePostDeploymentScript.Ver2.ps1
# Stop triggers
echo "⏸️ Stopping triggers..."
MSYS_NO_PATHCONV=1 pwsh ./PrePostDeploymentScript.Ver2.ps1 \
-armTemplate "./ARMTemplate/ARMTemplateForFactory.json" \
-ResourceGroupName "$RESOURCE_GROUP" \
-DataFactoryName "$FACTORY_NAME" \
-predeployment $true \
-deleteDeployment $false
# Deploy ARM template
echo "🚀 Deploying ARM template..."
az deployment group create \
--resource-group "$RESOURCE_GROUP" \
--template-file ./ARMTemplate/ARMTemplateForFactory.json \
--parameters ./ARMTemplate/ARMTemplateParametersForFactory.json \
--parameters factoryName="$FACTORY_NAME"
# Start triggers
echo "▶️ Starting triggers..."
MSYS_NO_PATHCONV=1 pwsh ./PrePostDeploymentScript.Ver2.ps1 \
-armTemplate "./ARMTemplate/ARMTemplateForFactory.json" \
-ResourceGroupName "$RESOURCE_GROUP" \
-DataFactoryName "$FACTORY_NAME" \
-predeployment $false \
-deleteDeployment $true
echo "✅ Deployment complete"
```
### package.json with Shell Detection
```json
{
"scripts": {
"prevalidate": "node scripts/detect-shell.js",
"validate": "node node_modules/@microsoft/azure-data-factory-utilities/lib/index validate",
"prebuild": "node scripts/detect-shell.js",
"build": "node node_modules/@microsoft/azure-data-factory-utilities/lib/index export"
},
"dependencies": {
"@microsoft/azure-data-factory-utilities": "^1.0.3"
}
}
```
**scripts/detect-shell.js**:
```javascript
const detectShell = () => {
if (process.env.MSYSTEM) {
console.log('🔧 Git Bash detected - disabling path conversion');
process.env.MSYS_NO_PATHCONV = '1';
return 'git-bash';
}
console.log(`🖥️ Shell: ${process.platform}`);
return process.platform;
};
detectShell();
```
## Common Issues and Solutions
### Issue 1: npm build validate fails with "Resource not found"
**Symptom:**
```bash
npm run build validate ./adf-resources /subscriptions/abc/...
# Error: Resource '/subscriptions/C:/Program Files/Git/subscriptions/abc/...' not found
```
**Cause:** Git Bash converted the factory ID path
**Solution:**
```bash
export MSYS_NO_PATHCONV=1
npm run build validate ./adf-resources /subscriptions/abc/...
```
### Issue 2: PowerShell script paths incorrect
**Symptom:**
```bash
pwsh PrePostDeploymentScript.Ver2.ps1 -armTemplate "./ARM/template.json"
# Error: Cannot find path 'C:/Program Files/Git/ARM/template.json'
```
**Cause:** Git Bash converted the ARM template path
**Solution:**
```bash
MSYS_NO_PATHCONV=1 pwsh PrePostDeploymentScript.Ver2.ps1 -armTemplate "./ARM/template.json"
```
### Issue 3: Azure CLI template-file parameter fails
**Symptom:**
```bash
az deployment group create --template-file ./ARMTemplate/file.json
# Error: Template file not found
```
**Cause:** Path conversion interfering with Azure CLI
**Solution:**
```bash
export MSYS_NO_PATHCONV=1
az deployment group create --template-file ./ARMTemplate/file.json
```
## Best Practices
### 1. Set MSYS_NO_PATHCONV in .bashrc
```bash
# Add to ~/.bashrc for Git Bash
if [ -n "$MSYSTEM" ]; then
export MSYS_NO_PATHCONV=1
fi
```
### 2. Create Wrapper Scripts
```bash
# adf-cli.sh - Wrapper for ADF npm commands
#!/usr/bin/env bash
export MSYS_NO_PATHCONV=1
npm run build "$@"
```
### 3. Use Relative Paths with ./
```bash
# Prefer this (less likely to trigger conversion)
./ARMTemplate/ARMTemplateForFactory.json
# Over this
ARMTemplate/ARMTemplateForFactory.json
```
### 4. Document Shell Requirements
```markdown
# README.md
## Development Environment
### Windows Users
- Use Git Bash or PowerShell Core (pwsh)
- Git Bash users: Add `export MSYS_NO_PATHCONV=1` to .bashrc
- Alternative: Use WSL2 for native Linux environment
```
### 5. Test on Multiple Shells
```bash
# Test matrix for Windows developers
- Git Bash (MINGW64)
- PowerShell Core 7+
- WSL2 (Ubuntu/Debian)
- cmd.exe (if applicable)
```
## Quick Reference
| Environment Variable | Purpose | Value |
|---------------------|---------|-------|
| `MSYS_NO_PATHCONV` | Disable all path conversion (Git for Windows) | `1` |
| `MSYS2_ARG_CONV_EXCL` | Exclude specific arguments from conversion (MSYS2) | `*` or patterns |
| `MSYSTEM` | Current MSYS subsystem | `MINGW64`, `MINGW32`, `MSYS` |
| `WSL_DISTRO_NAME` | WSL distribution name | `Ubuntu`, `Debian`, etc. |
## Resources
- [Git for Windows Path Conversion](https://github.com/git-for-windows/git/wiki/FAQ#path-conversion)
- [MSYS2 Path Conversion](https://www.msys2.org/docs/filesystem-paths/)
- [Azure CLI on Windows](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-windows)
- [PowerShell Core](https://learn.microsoft.com/en-us/powershell/scripting/install/installing-powershell)
## Summary
**Key Takeaways:**
1. Git Bash automatically converts Unix-style paths to Windows paths
2. Use `export MSYS_NO_PATHCONV=1` to disable conversion
3. Detect shell environment using `$MSYSTEM` variable
4. Test CI/CD scripts on all shells used by your team
5. Use PowerShell Core (pwsh) for cross-platform scripts
6. Add shell detection to local development scripts