Initial commit
This commit is contained in:
812
skills/databricks-2025.md
Normal file
812
skills/databricks-2025.md
Normal file
@@ -0,0 +1,812 @@
|
||||
---
|
||||
name: databricks-2025
|
||||
description: Databricks Job activity and 2025 Azure Data Factory connectors
|
||||
---
|
||||
|
||||
## 🚨 CRITICAL GUIDELINES
|
||||
|
||||
### Windows File Path Requirements
|
||||
|
||||
**MANDATORY: Always Use Backslashes on Windows for File Paths**
|
||||
|
||||
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
|
||||
|
||||
**Examples:**
|
||||
- ❌ WRONG: `D:/repos/project/file.tsx`
|
||||
- ✅ CORRECT: `D:\repos\project\file.tsx`
|
||||
|
||||
This applies to:
|
||||
- Edit tool file_path parameter
|
||||
- Write tool file_path parameter
|
||||
- All file operations on Windows systems
|
||||
|
||||
|
||||
### Documentation Guidelines
|
||||
|
||||
**NEVER create new documentation files unless explicitly requested by the user.**
|
||||
|
||||
- **Priority**: Update existing README.md files rather than creating new documentation
|
||||
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
|
||||
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
|
||||
- **User preference**: Only create additional .md files when user specifically asks for documentation
|
||||
|
||||
|
||||
---
|
||||
|
||||
# Azure Data Factory Databricks Integration 2025
|
||||
|
||||
## Databricks Job Activity (Recommended 2025)
|
||||
|
||||
**🚨 CRITICAL UPDATE (2025):** The Databricks Job activity is now the **ONLY recommended method** for orchestrating Databricks in ADF. Microsoft strongly recommends migrating from legacy Notebook, Python, and JAR activities.
|
||||
|
||||
### Why Databricks Job Activity?
|
||||
|
||||
**Old Pattern (Notebook Activity - ❌ LEGACY):**
|
||||
```json
|
||||
{
|
||||
"name": "RunNotebook",
|
||||
"type": "DatabricksNotebook", // ❌ DEPRECATED - Migrate to DatabricksJob
|
||||
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
|
||||
"typeProperties": {
|
||||
"notebookPath": "/Users/user@example.com/MyNotebook",
|
||||
"baseParameters": { "param1": "value1" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**New Pattern (Databricks Job Activity - ✅ CURRENT 2025):**
|
||||
```json
|
||||
{
|
||||
"name": "RunDatabricksWorkflow",
|
||||
"type": "DatabricksJob", // ✅ CORRECT activity type (NOT DatabricksSparkJob)
|
||||
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
|
||||
"typeProperties": {
|
||||
"jobId": "123456", // Reference existing Databricks Workflow Job
|
||||
"jobParameters": { // Pass parameters to the Job
|
||||
"param1": "value1",
|
||||
"runDate": "@pipeline().parameters.ProcessingDate"
|
||||
}
|
||||
},
|
||||
"policy": {
|
||||
"timeout": "0.12:00:00",
|
||||
"retry": 2,
|
||||
"retryIntervalInSeconds": 30
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Benefits of Databricks Job Activity (2025)
|
||||
|
||||
1. **Serverless Execution by Default:**
|
||||
- ✅ No cluster specification needed in linked service
|
||||
- ✅ Automatically runs on Databricks serverless compute
|
||||
- ✅ Faster startup times and lower costs
|
||||
- ✅ Managed infrastructure by Databricks
|
||||
|
||||
2. **Advanced Workflow Features:**
|
||||
- ✅ **Run As** - Execute jobs as specific users/service principals
|
||||
- ✅ **Task Values** - Pass data between tasks within workflow
|
||||
- ✅ **Conditional Execution** - If/Else and For Each task types
|
||||
- ✅ **AI/BI Tasks** - Model serving endpoints, Power BI semantic models
|
||||
- ✅ **Repair Runs** - Rerun failed tasks without reprocessing successful ones
|
||||
- ✅ **Notifications/Alerts** - Built-in alerting on job failures
|
||||
- ✅ **Git Integration** - Version control for notebooks and code
|
||||
- ✅ **DABs Support** - Databricks Asset Bundles for deployment
|
||||
- ✅ **Built-in Lineage** - Data lineage tracking across tasks
|
||||
- ✅ **Queuing and Concurrent Runs** - Better resource management
|
||||
|
||||
3. **Centralized Job Management:**
|
||||
- Jobs defined once in Databricks workspace
|
||||
- Single source of truth for all environments
|
||||
- Versioning through Databricks (Git-backed)
|
||||
- Consistent across orchestration tools
|
||||
|
||||
4. **Better Orchestration:**
|
||||
- Complex task dependencies within Job
|
||||
- Multiple heterogeneous tasks (notebook, Python, SQL, Delta Live Tables)
|
||||
- Job-level monitoring and logging
|
||||
- Parameter passing between tasks
|
||||
|
||||
5. **Improved Reliability:**
|
||||
- Retry logic at Job and task level
|
||||
- Better error handling and recovery
|
||||
- Automatic cluster management
|
||||
|
||||
6. **Cost Optimization:**
|
||||
- Serverless compute (pay only for execution)
|
||||
- Job clusters (auto-terminating)
|
||||
- Optimized cluster sizing per task
|
||||
- Spot instance support
|
||||
|
||||
### Implementation
|
||||
|
||||
#### 1. Create Databricks Job
|
||||
|
||||
```python
|
||||
# In Databricks workspace
|
||||
# Create Job with tasks
|
||||
{
|
||||
"name": "Data Processing Job",
|
||||
"tasks": [
|
||||
{
|
||||
"task_key": "ingest",
|
||||
"notebook_task": {
|
||||
"notebook_path": "/Notebooks/Ingest",
|
||||
"base_parameters": {}
|
||||
},
|
||||
"job_cluster_key": "small_cluster"
|
||||
},
|
||||
{
|
||||
"task_key": "transform",
|
||||
"depends_on": [{ "task_key": "ingest" }],
|
||||
"notebook_task": {
|
||||
"notebook_path": "/Notebooks/Transform"
|
||||
},
|
||||
"job_cluster_key": "medium_cluster"
|
||||
},
|
||||
{
|
||||
"task_key": "load",
|
||||
"depends_on": [{ "task_key": "transform" }],
|
||||
"notebook_task": {
|
||||
"notebook_path": "/Notebooks/Load"
|
||||
},
|
||||
"job_cluster_key": "small_cluster"
|
||||
}
|
||||
],
|
||||
"job_clusters": [
|
||||
{
|
||||
"job_cluster_key": "small_cluster",
|
||||
"new_cluster": {
|
||||
"spark_version": "13.3.x-scala2.12",
|
||||
"node_type_id": "Standard_DS3_v2",
|
||||
"num_workers": 2
|
||||
}
|
||||
},
|
||||
{
|
||||
"job_cluster_key": "medium_cluster",
|
||||
"new_cluster": {
|
||||
"spark_version": "13.3.x-scala2.12",
|
||||
"node_type_id": "Standard_DS4_v2",
|
||||
"num_workers": 8
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Get Job ID after creation
|
||||
```
|
||||
|
||||
#### 2. Create ADF Pipeline with Databricks Job Activity (2025)
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "PL_Databricks_Serverless_Workflow",
|
||||
"properties": {
|
||||
"activities": [
|
||||
{
|
||||
"name": "ExecuteDatabricksWorkflow",
|
||||
"type": "DatabricksJob", // ✅ Correct activity type
|
||||
"dependsOn": [],
|
||||
"policy": {
|
||||
"timeout": "0.12:00:00",
|
||||
"retry": 2,
|
||||
"retryIntervalInSeconds": 30
|
||||
},
|
||||
"typeProperties": {
|
||||
"jobId": "123456", // Databricks Job ID from workspace
|
||||
"jobParameters": { // ⚠️ Use jobParameters (not parameters)
|
||||
"input_path": "/mnt/data/input",
|
||||
"output_path": "/mnt/data/output",
|
||||
"run_date": "@pipeline().parameters.runDate",
|
||||
"environment": "@pipeline().parameters.environment"
|
||||
}
|
||||
},
|
||||
"linkedServiceName": {
|
||||
"referenceName": "DatabricksLinkedService_Serverless",
|
||||
"type": "LinkedServiceReference"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "LogJobExecution",
|
||||
"type": "WebActivity",
|
||||
"dependsOn": [
|
||||
{
|
||||
"activity": "ExecuteDatabricksWorkflow",
|
||||
"dependencyConditions": ["Succeeded"]
|
||||
}
|
||||
],
|
||||
"typeProperties": {
|
||||
"url": "@pipeline().parameters.LoggingEndpoint",
|
||||
"method": "POST",
|
||||
"body": {
|
||||
"jobId": "123456",
|
||||
"runId": "@activity('ExecuteDatabricksWorkflow').output.runId",
|
||||
"status": "Succeeded",
|
||||
"duration": "@activity('ExecuteDatabricksWorkflow').output.executionDuration"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"parameters": {
|
||||
"runDate": {
|
||||
"type": "string",
|
||||
"defaultValue": "@utcnow()"
|
||||
},
|
||||
"environment": {
|
||||
"type": "string",
|
||||
"defaultValue": "production"
|
||||
},
|
||||
"LoggingEndpoint": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. Configure Linked Service (2025 - Serverless)
|
||||
|
||||
**✅ RECOMMENDED: Serverless Linked Service (No Cluster Configuration)**
|
||||
```json
|
||||
{
|
||||
"name": "DatabricksLinkedService_Serverless",
|
||||
"type": "Microsoft.DataFactory/factories/linkedservices",
|
||||
"properties": {
|
||||
"type": "AzureDatabricks",
|
||||
"typeProperties": {
|
||||
"domain": "https://adb-123456789.azuredatabricks.net",
|
||||
"authentication": "MSI" // ✅ Managed Identity (recommended 2025)
|
||||
// ⚠️ NO existingClusterId or newClusterNodeType needed for serverless!
|
||||
// The Databricks Job activity automatically uses serverless compute
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Alternative: Access Token Authentication**
|
||||
```json
|
||||
{
|
||||
"name": "DatabricksLinkedService_Token",
|
||||
"type": "Microsoft.DataFactory/factories/linkedservices",
|
||||
"properties": {
|
||||
"type": "AzureDatabricks",
|
||||
"typeProperties": {
|
||||
"domain": "https://adb-123456789.azuredatabricks.net",
|
||||
"accessToken": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": {
|
||||
"referenceName": "AzureKeyVault",
|
||||
"type": "LinkedServiceReference"
|
||||
},
|
||||
"secretName": "databricks-access-token"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**🚨 CRITICAL: For Databricks Job activity, DO NOT specify cluster properties in the linked service. The job configuration in Databricks workspace controls compute resources.**
|
||||
|
||||
## 🆕 2025 New Connectors and Enhancements
|
||||
|
||||
### ServiceNow V2 Connector (RECOMMENDED - V1 End of Support)
|
||||
|
||||
**🚨 CRITICAL: ServiceNow V1 connector is at End of Support stage. Migrate to V2 immediately!**
|
||||
|
||||
**Key Features of V2:**
|
||||
- ✅ **Native Query Builder** - Aligns with ServiceNow's condition builder experience
|
||||
- ✅ **Enhanced Performance** - Optimized data extraction
|
||||
- ✅ **Better Error Handling** - Improved diagnostics and retry logic
|
||||
- ✅ **OData Support** - Modern API integration patterns
|
||||
|
||||
**Copy Activity Example:**
|
||||
```json
|
||||
{
|
||||
"name": "CopyFromServiceNowV2",
|
||||
"type": "Copy",
|
||||
"inputs": [
|
||||
{
|
||||
"referenceName": "ServiceNowV2Source",
|
||||
"type": "DatasetReference"
|
||||
}
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"referenceName": "AzureSqlSink",
|
||||
"type": "DatasetReference"
|
||||
}
|
||||
],
|
||||
"typeProperties": {
|
||||
"source": {
|
||||
"type": "ServiceNowV2Source",
|
||||
"query": "sysparm_query=active=true^priority=1^sys_created_on>=javascript:gs.dateGenerate('2025-01-01')",
|
||||
"httpRequestTimeout": "00:01:40" // 100 seconds
|
||||
},
|
||||
"sink": {
|
||||
"type": "AzureSqlSink",
|
||||
"writeBehavior": "upsert",
|
||||
"upsertSettings": {
|
||||
"useTempDB": true,
|
||||
"keys": ["sys_id"]
|
||||
}
|
||||
},
|
||||
"enableStaging": true,
|
||||
"stagingSettings": {
|
||||
"linkedServiceName": {
|
||||
"referenceName": "AzureBlobStorage",
|
||||
"type": "LinkedServiceReference"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Linked Service (OAuth2 - Recommended):**
|
||||
```json
|
||||
{
|
||||
"name": "ServiceNowV2LinkedService",
|
||||
"type": "Microsoft.DataFactory/factories/linkedservices",
|
||||
"properties": {
|
||||
"type": "ServiceNowV2",
|
||||
"typeProperties": {
|
||||
"endpoint": "https://dev12345.service-now.com",
|
||||
"authenticationType": "OAuth2",
|
||||
"clientId": "your-oauth-client-id",
|
||||
"clientSecret": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": {
|
||||
"referenceName": "AzureKeyVault",
|
||||
"type": "LinkedServiceReference"
|
||||
},
|
||||
"secretName": "servicenow-client-secret"
|
||||
},
|
||||
"username": "service-account@company.com",
|
||||
"password": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": {
|
||||
"referenceName": "AzureKeyVault",
|
||||
"type": "LinkedServiceReference"
|
||||
},
|
||||
"secretName": "servicenow-password"
|
||||
},
|
||||
"grantType": "password"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Linked Service (Basic Authentication - Legacy):**
|
||||
```json
|
||||
{
|
||||
"name": "ServiceNowV2LinkedService_Basic",
|
||||
"type": "Microsoft.DataFactory/factories/linkedservices",
|
||||
"properties": {
|
||||
"type": "ServiceNowV2",
|
||||
"typeProperties": {
|
||||
"endpoint": "https://dev12345.service-now.com",
|
||||
"authenticationType": "Basic",
|
||||
"username": "admin",
|
||||
"password": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": {
|
||||
"referenceName": "AzureKeyVault",
|
||||
"type": "LinkedServiceReference"
|
||||
},
|
||||
"secretName": "servicenow-password"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Migration from V1 to V2:**
|
||||
1. Update linked service type from `ServiceNow` to `ServiceNowV2`
|
||||
2. Update source type from `ServiceNowSource` to `ServiceNowV2Source`
|
||||
3. Test queries in ServiceNow UI's condition builder first
|
||||
4. Adjust timeout settings if needed (V2 may have different performance)
|
||||
|
||||
### Enhanced PostgreSQL Connector
|
||||
|
||||
Improved performance and features:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "PostgreSQLLinkedService",
|
||||
"type": "PostgreSql",
|
||||
"typeProperties": {
|
||||
"connectionString": "host=myserver.postgres.database.azure.com;port=5432;database=mydb;uid=myuser",
|
||||
"password": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": { "referenceName": "KeyVault" },
|
||||
"secretName": "postgres-password"
|
||||
},
|
||||
// 2025 enhancement
|
||||
"enableSsl": true,
|
||||
"sslMode": "Require"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Microsoft Fabric Warehouse Connector (NEW 2025)
|
||||
|
||||
**🆕 Native support for Microsoft Fabric Warehouse (Q3 2024+)**
|
||||
|
||||
**Supported Activities:**
|
||||
- ✅ Copy Activity (source and sink)
|
||||
- ✅ Lookup Activity
|
||||
- ✅ Get Metadata Activity
|
||||
- ✅ Script Activity
|
||||
- ✅ Stored Procedure Activity
|
||||
|
||||
**Linked Service Configuration:**
|
||||
```json
|
||||
{
|
||||
"name": "FabricWarehouseLinkedService",
|
||||
"type": "Microsoft.DataFactory/factories/linkedservices",
|
||||
"properties": {
|
||||
"type": "Warehouse", // ✅ NEW dedicated Fabric Warehouse type
|
||||
"typeProperties": {
|
||||
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
|
||||
"warehouse": "MyWarehouse",
|
||||
"authenticationType": "ServicePrincipal", // Recommended
|
||||
"servicePrincipalId": "<app-registration-id>",
|
||||
"servicePrincipalKey": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": {
|
||||
"referenceName": "AzureKeyVault",
|
||||
"type": "LinkedServiceReference"
|
||||
},
|
||||
"secretName": "fabric-warehouse-sp-key"
|
||||
},
|
||||
"tenant": "<tenant-id>"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Alternative: Managed Identity Authentication (Preferred)**
|
||||
```json
|
||||
{
|
||||
"name": "FabricWarehouseLinkedService_ManagedIdentity",
|
||||
"type": "Microsoft.DataFactory/factories/linkedservices",
|
||||
"properties": {
|
||||
"type": "Warehouse",
|
||||
"typeProperties": {
|
||||
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
|
||||
"warehouse": "MyWarehouse",
|
||||
"authenticationType": "SystemAssignedManagedIdentity"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Copy Activity Example:**
|
||||
```json
|
||||
{
|
||||
"name": "CopyToFabricWarehouse",
|
||||
"type": "Copy",
|
||||
"inputs": [
|
||||
{
|
||||
"referenceName": "AzureSqlSource",
|
||||
"type": "DatasetReference"
|
||||
}
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"referenceName": "FabricWarehouseSink",
|
||||
"type": "DatasetReference"
|
||||
}
|
||||
],
|
||||
"typeProperties": {
|
||||
"source": {
|
||||
"type": "AzureSqlSource"
|
||||
},
|
||||
"sink": {
|
||||
"type": "WarehouseSink",
|
||||
"writeBehavior": "insert", // or "upsert"
|
||||
"writeBatchSize": 10000,
|
||||
"tableOption": "autoCreate" // Auto-create table if not exists
|
||||
},
|
||||
"enableStaging": true, // Recommended for large data
|
||||
"stagingSettings": {
|
||||
"linkedServiceName": {
|
||||
"referenceName": "AzureBlobStorage",
|
||||
"type": "LinkedServiceReference"
|
||||
},
|
||||
"path": "staging/fabric-warehouse"
|
||||
},
|
||||
"translator": {
|
||||
"type": "TabularTranslator",
|
||||
"mappings": [
|
||||
{
|
||||
"source": { "name": "CustomerID" },
|
||||
"sink": { "name": "customer_id" }
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Best Practices for Fabric Warehouse:**
|
||||
- ✅ Use managed identity for authentication (no secret rotation)
|
||||
- ✅ Enable staging for large data loads (> 1GB)
|
||||
- ✅ Use `tableOption: autoCreate` for dynamic schema creation
|
||||
- ✅ Leverage Fabric's lakehouse integration for unified analytics
|
||||
- ✅ Monitor Fabric capacity units (CU) consumption
|
||||
|
||||
### Enhanced Snowflake Connector
|
||||
|
||||
Improved performance:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "SnowflakeLinkedService",
|
||||
"type": "Snowflake",
|
||||
"typeProperties": {
|
||||
"connectionString": "jdbc:snowflake://myaccount.snowflakecomputing.com",
|
||||
"database": "mydb",
|
||||
"warehouse": "mywarehouse",
|
||||
"authenticationType": "KeyPair",
|
||||
"username": "myuser",
|
||||
"privateKey": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": { "referenceName": "KeyVault" },
|
||||
"secretName": "snowflake-private-key"
|
||||
},
|
||||
"privateKeyPassphrase": {
|
||||
"type": "AzureKeyVaultSecret",
|
||||
"store": { "referenceName": "KeyVault" },
|
||||
"secretName": "snowflake-passphrase"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Managed Identity for Azure Storage (2025)
|
||||
|
||||
### Azure Table Storage
|
||||
|
||||
Now supports system-assigned and user-assigned managed identity:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "AzureTableStorageLinkedService",
|
||||
"type": "AzureTableStorage",
|
||||
"typeProperties": {
|
||||
"serviceEndpoint": "https://mystorageaccount.table.core.windows.net",
|
||||
"authenticationType": "ManagedIdentity" // New in 2025
|
||||
// Or user-assigned:
|
||||
// "credential": {
|
||||
// "referenceName": "UserAssignedManagedIdentity"
|
||||
// }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Azure Files
|
||||
|
||||
Now supports managed identity authentication:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "AzureFilesLinkedService",
|
||||
"type": "AzureFileStorage",
|
||||
"typeProperties": {
|
||||
"fileShare": "myshare",
|
||||
"accountName": "mystorageaccount",
|
||||
"authenticationType": "ManagedIdentity" // New in 2025
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Mapping Data Flows - Spark 3.3
|
||||
|
||||
Spark 3.3 now powers Mapping Data Flows:
|
||||
|
||||
**Performance Improvements:**
|
||||
- 30% faster data processing
|
||||
- Improved memory management
|
||||
- Better partition handling
|
||||
- Enhanced join performance
|
||||
|
||||
**New Features:**
|
||||
- Adaptive Query Execution (AQE)
|
||||
- Dynamic partition pruning
|
||||
- Improved caching
|
||||
- Better column statistics
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "DataFlow1",
|
||||
"type": "MappingDataFlow",
|
||||
"typeProperties": {
|
||||
"sources": [
|
||||
{
|
||||
"dataset": { "referenceName": "SourceDataset" }
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"name": "Transform1"
|
||||
}
|
||||
],
|
||||
"sinks": [
|
||||
{
|
||||
"dataset": { "referenceName": "SinkDataset" }
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Azure DevOps Server 2022 Support
|
||||
|
||||
Git integration now supports on-premises Azure DevOps Server 2022:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "DataFactory",
|
||||
"properties": {
|
||||
"repoConfiguration": {
|
||||
"type": "AzureDevOpsGit",
|
||||
"accountName": "on-prem-ado-server",
|
||||
"projectName": "MyProject",
|
||||
"repositoryName": "adf-repo",
|
||||
"collaborationBranch": "main",
|
||||
"rootFolder": "/",
|
||||
"hostName": "https://ado-server.company.com" // On-premises server
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🔐 Managed Identity 2025 Best Practices
|
||||
|
||||
### User-Assigned vs System-Assigned Managed Identity
|
||||
|
||||
**System-Assigned Managed Identity:**
|
||||
```json
|
||||
{
|
||||
"type": "AzureBlobStorage",
|
||||
"typeProperties": {
|
||||
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
|
||||
"accountKind": "StorageV2"
|
||||
// ✅ Uses Data Factory's system-assigned identity automatically
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**User-Assigned Managed Identity (NEW 2025):**
|
||||
```json
|
||||
{
|
||||
"type": "AzureBlobStorage",
|
||||
"typeProperties": {
|
||||
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
|
||||
"accountKind": "StorageV2",
|
||||
"credential": {
|
||||
"referenceName": "UserAssignedManagedIdentityCredential",
|
||||
"type": "CredentialReference"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**When to Use User-Assigned:**
|
||||
- ✅ Sharing identity across multiple data factories
|
||||
- ✅ Complex multi-environment setups
|
||||
- ✅ Granular permission management
|
||||
- ✅ Identity lifecycle independent of data factory
|
||||
|
||||
**Credential Consolidation (NEW 2025):**
|
||||
|
||||
ADF now supports a centralized **Credentials** feature:
|
||||
```json
|
||||
{
|
||||
"name": "ManagedIdentityCredential",
|
||||
"type": "Microsoft.DataFactory/factories/credentials",
|
||||
"properties": {
|
||||
"type": "ManagedIdentity",
|
||||
"typeProperties": {
|
||||
"resourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Consolidate all Microsoft Entra ID-based credentials in one place
|
||||
- ✅ Reuse credentials across multiple linked services
|
||||
- ✅ Centralized permission management
|
||||
- ✅ Easier audit and compliance tracking
|
||||
|
||||
### MFA Enforcement Compatibility (October 2025)
|
||||
|
||||
**🚨 IMPORTANT: Azure requires MFA for all users by October 2025**
|
||||
|
||||
**Impact on ADF:**
|
||||
- ✅ **Managed identities are UNAFFECTED** - No MFA required for service accounts
|
||||
- ✅ Continue using system-assigned and user-assigned identities without changes
|
||||
- ❌ **Interactive user logins affected** - Personal Azure AD accounts need MFA
|
||||
- ✅ **Service principals with certificate auth** - Recommended alternative to secrets
|
||||
|
||||
**Best Practice:**
|
||||
```json
|
||||
{
|
||||
"type": "AzureSqlDatabase",
|
||||
"typeProperties": {
|
||||
"server": "myserver.database.windows.net",
|
||||
"database": "mydb",
|
||||
"authenticationType": "SystemAssignedManagedIdentity"
|
||||
// ✅ No MFA needed, no secret rotation, passwordless
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Principle of Least Privilege (2025)
|
||||
|
||||
**Storage Blob Data Roles:**
|
||||
- `Storage Blob Data Reader` - Read-only access (source)
|
||||
- `Storage Blob Data Contributor` - Read/write access (sink)
|
||||
- ❌ Avoid `Storage Blob Data Owner` unless needed
|
||||
|
||||
**SQL Database Roles:**
|
||||
```sql
|
||||
-- Create contained database user for managed identity
|
||||
CREATE USER [datafactory-name] FROM EXTERNAL PROVIDER;
|
||||
|
||||
-- Grant minimal required permissions
|
||||
ALTER ROLE db_datareader ADD MEMBER [datafactory-name];
|
||||
ALTER ROLE db_datawriter ADD MEMBER [datafactory-name];
|
||||
|
||||
-- ❌ Avoid db_owner unless truly needed
|
||||
```
|
||||
|
||||
**Key Vault Access Policies:**
|
||||
```json
|
||||
{
|
||||
"permissions": {
|
||||
"secrets": ["Get"] // ✅ Only Get permission needed
|
||||
// ❌ Don't grant List, Set, Delete unless required
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices (2025)
|
||||
|
||||
1. **Use Databricks Job Activity (MANDATORY):**
|
||||
- ❌ STOP using Notebook, Python, JAR activities
|
||||
- ✅ Migrate to DatabricksJob activity immediately
|
||||
- ✅ Define workflows in Databricks workspace
|
||||
- ✅ Leverage serverless compute (no cluster config needed)
|
||||
- ✅ Utilize advanced features (Run As, Task Values, If/Else, Repair Runs)
|
||||
|
||||
2. **Managed Identity Authentication (MANDATORY 2025):**
|
||||
- ✅ Use managed identities for ALL Azure resources
|
||||
- ✅ Prefer system-assigned for simple scenarios
|
||||
- ✅ Use user-assigned for shared identity needs
|
||||
- ✅ Leverage Credentials feature for consolidation
|
||||
- ✅ MFA-compliant for October 2025 enforcement
|
||||
- ❌ Avoid access keys and connection strings
|
||||
- ✅ Store any remaining secrets in Key Vault
|
||||
|
||||
3. **Monitor Job Execution:**
|
||||
- Track Databricks Job run IDs from ADF output
|
||||
- Log Job parameters for auditability
|
||||
- Set up alerts for job failures
|
||||
- Use Databricks job-level monitoring
|
||||
- Leverage built-in lineage tracking
|
||||
|
||||
4. **Optimize Spark 3.3 Usage (Data Flows):**
|
||||
- Enable Adaptive Query Execution (AQE)
|
||||
- Use appropriate partition counts (4-8 per core)
|
||||
- Monitor execution plans in Databricks
|
||||
- Use broadcast joins for small dimensions
|
||||
- Implement dynamic partition pruning
|
||||
|
||||
## Resources
|
||||
|
||||
- [Databricks Job Activity](https://learn.microsoft.com/azure/data-factory/transform-data-using-databricks-spark-job)
|
||||
- [ADF Connectors](https://learn.microsoft.com/azure/data-factory/connector-overview)
|
||||
- [Managed Identity Authentication](https://learn.microsoft.com/azure/data-factory/data-factory-service-identity)
|
||||
- [Mapping Data Flows](https://learn.microsoft.com/azure/data-factory/concepts-data-flow-overview)
|
||||
Reference in New Issue
Block a user