813 lines
22 KiB
Markdown
813 lines
22 KiB
Markdown
---
|
|
name: databricks-2025
|
|
description: Databricks Job activity and 2025 Azure Data Factory connectors
|
|
---
|
|
|
|
## 🚨 CRITICAL GUIDELINES
|
|
|
|
### Windows File Path Requirements
|
|
|
|
**MANDATORY: Always Use Backslashes on Windows for File Paths**
|
|
|
|
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
|
|
|
|
**Examples:**
|
|
- ❌ WRONG: `D:/repos/project/file.tsx`
|
|
- ✅ CORRECT: `D:\repos\project\file.tsx`
|
|
|
|
This applies to:
|
|
- Edit tool file_path parameter
|
|
- Write tool file_path parameter
|
|
- All file operations on Windows systems
|
|
|
|
|
|
### Documentation Guidelines
|
|
|
|
**NEVER create new documentation files unless explicitly requested by the user.**
|
|
|
|
- **Priority**: Update existing README.md files rather than creating new documentation
|
|
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
|
|
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
|
|
- **User preference**: Only create additional .md files when user specifically asks for documentation
|
|
|
|
|
|
---
|
|
|
|
# Azure Data Factory Databricks Integration 2025
|
|
|
|
## Databricks Job Activity (Recommended 2025)
|
|
|
|
**🚨 CRITICAL UPDATE (2025):** The Databricks Job activity is now the **ONLY recommended method** for orchestrating Databricks in ADF. Microsoft strongly recommends migrating from legacy Notebook, Python, and JAR activities.
|
|
|
|
### Why Databricks Job Activity?
|
|
|
|
**Old Pattern (Notebook Activity - ❌ LEGACY):**
|
|
```json
|
|
{
|
|
"name": "RunNotebook",
|
|
"type": "DatabricksNotebook", // ❌ DEPRECATED - Migrate to DatabricksJob
|
|
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
|
|
"typeProperties": {
|
|
"notebookPath": "/Users/user@example.com/MyNotebook",
|
|
"baseParameters": { "param1": "value1" }
|
|
}
|
|
}
|
|
```
|
|
|
|
**New Pattern (Databricks Job Activity - ✅ CURRENT 2025):**
|
|
```json
|
|
{
|
|
"name": "RunDatabricksWorkflow",
|
|
"type": "DatabricksJob", // ✅ CORRECT activity type (NOT DatabricksSparkJob)
|
|
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
|
|
"typeProperties": {
|
|
"jobId": "123456", // Reference existing Databricks Workflow Job
|
|
"jobParameters": { // Pass parameters to the Job
|
|
"param1": "value1",
|
|
"runDate": "@pipeline().parameters.ProcessingDate"
|
|
}
|
|
},
|
|
"policy": {
|
|
"timeout": "0.12:00:00",
|
|
"retry": 2,
|
|
"retryIntervalInSeconds": 30
|
|
}
|
|
}
|
|
```
|
|
|
|
### Benefits of Databricks Job Activity (2025)
|
|
|
|
1. **Serverless Execution by Default:**
|
|
- ✅ No cluster specification needed in linked service
|
|
- ✅ Automatically runs on Databricks serverless compute
|
|
- ✅ Faster startup times and lower costs
|
|
- ✅ Managed infrastructure by Databricks
|
|
|
|
2. **Advanced Workflow Features:**
|
|
- ✅ **Run As** - Execute jobs as specific users/service principals
|
|
- ✅ **Task Values** - Pass data between tasks within workflow
|
|
- ✅ **Conditional Execution** - If/Else and For Each task types
|
|
- ✅ **AI/BI Tasks** - Model serving endpoints, Power BI semantic models
|
|
- ✅ **Repair Runs** - Rerun failed tasks without reprocessing successful ones
|
|
- ✅ **Notifications/Alerts** - Built-in alerting on job failures
|
|
- ✅ **Git Integration** - Version control for notebooks and code
|
|
- ✅ **DABs Support** - Databricks Asset Bundles for deployment
|
|
- ✅ **Built-in Lineage** - Data lineage tracking across tasks
|
|
- ✅ **Queuing and Concurrent Runs** - Better resource management
|
|
|
|
3. **Centralized Job Management:**
|
|
- Jobs defined once in Databricks workspace
|
|
- Single source of truth for all environments
|
|
- Versioning through Databricks (Git-backed)
|
|
- Consistent across orchestration tools
|
|
|
|
4. **Better Orchestration:**
|
|
- Complex task dependencies within Job
|
|
- Multiple heterogeneous tasks (notebook, Python, SQL, Delta Live Tables)
|
|
- Job-level monitoring and logging
|
|
- Parameter passing between tasks
|
|
|
|
5. **Improved Reliability:**
|
|
- Retry logic at Job and task level
|
|
- Better error handling and recovery
|
|
- Automatic cluster management
|
|
|
|
6. **Cost Optimization:**
|
|
- Serverless compute (pay only for execution)
|
|
- Job clusters (auto-terminating)
|
|
- Optimized cluster sizing per task
|
|
- Spot instance support
|
|
|
|
### Implementation
|
|
|
|
#### 1. Create Databricks Job
|
|
|
|
```python
|
|
# In Databricks workspace
|
|
# Create Job with tasks
|
|
{
|
|
"name": "Data Processing Job",
|
|
"tasks": [
|
|
{
|
|
"task_key": "ingest",
|
|
"notebook_task": {
|
|
"notebook_path": "/Notebooks/Ingest",
|
|
"base_parameters": {}
|
|
},
|
|
"job_cluster_key": "small_cluster"
|
|
},
|
|
{
|
|
"task_key": "transform",
|
|
"depends_on": [{ "task_key": "ingest" }],
|
|
"notebook_task": {
|
|
"notebook_path": "/Notebooks/Transform"
|
|
},
|
|
"job_cluster_key": "medium_cluster"
|
|
},
|
|
{
|
|
"task_key": "load",
|
|
"depends_on": [{ "task_key": "transform" }],
|
|
"notebook_task": {
|
|
"notebook_path": "/Notebooks/Load"
|
|
},
|
|
"job_cluster_key": "small_cluster"
|
|
}
|
|
],
|
|
"job_clusters": [
|
|
{
|
|
"job_cluster_key": "small_cluster",
|
|
"new_cluster": {
|
|
"spark_version": "13.3.x-scala2.12",
|
|
"node_type_id": "Standard_DS3_v2",
|
|
"num_workers": 2
|
|
}
|
|
},
|
|
{
|
|
"job_cluster_key": "medium_cluster",
|
|
"new_cluster": {
|
|
"spark_version": "13.3.x-scala2.12",
|
|
"node_type_id": "Standard_DS4_v2",
|
|
"num_workers": 8
|
|
}
|
|
}
|
|
]
|
|
}
|
|
|
|
# Get Job ID after creation
|
|
```
|
|
|
|
#### 2. Create ADF Pipeline with Databricks Job Activity (2025)
|
|
|
|
```json
|
|
{
|
|
"name": "PL_Databricks_Serverless_Workflow",
|
|
"properties": {
|
|
"activities": [
|
|
{
|
|
"name": "ExecuteDatabricksWorkflow",
|
|
"type": "DatabricksJob", // ✅ Correct activity type
|
|
"dependsOn": [],
|
|
"policy": {
|
|
"timeout": "0.12:00:00",
|
|
"retry": 2,
|
|
"retryIntervalInSeconds": 30
|
|
},
|
|
"typeProperties": {
|
|
"jobId": "123456", // Databricks Job ID from workspace
|
|
"jobParameters": { // ⚠️ Use jobParameters (not parameters)
|
|
"input_path": "/mnt/data/input",
|
|
"output_path": "/mnt/data/output",
|
|
"run_date": "@pipeline().parameters.runDate",
|
|
"environment": "@pipeline().parameters.environment"
|
|
}
|
|
},
|
|
"linkedServiceName": {
|
|
"referenceName": "DatabricksLinkedService_Serverless",
|
|
"type": "LinkedServiceReference"
|
|
}
|
|
},
|
|
{
|
|
"name": "LogJobExecution",
|
|
"type": "WebActivity",
|
|
"dependsOn": [
|
|
{
|
|
"activity": "ExecuteDatabricksWorkflow",
|
|
"dependencyConditions": ["Succeeded"]
|
|
}
|
|
],
|
|
"typeProperties": {
|
|
"url": "@pipeline().parameters.LoggingEndpoint",
|
|
"method": "POST",
|
|
"body": {
|
|
"jobId": "123456",
|
|
"runId": "@activity('ExecuteDatabricksWorkflow').output.runId",
|
|
"status": "Succeeded",
|
|
"duration": "@activity('ExecuteDatabricksWorkflow').output.executionDuration"
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"parameters": {
|
|
"runDate": {
|
|
"type": "string",
|
|
"defaultValue": "@utcnow()"
|
|
},
|
|
"environment": {
|
|
"type": "string",
|
|
"defaultValue": "production"
|
|
},
|
|
"LoggingEndpoint": {
|
|
"type": "string"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 3. Configure Linked Service (2025 - Serverless)
|
|
|
|
**✅ RECOMMENDED: Serverless Linked Service (No Cluster Configuration)**
|
|
```json
|
|
{
|
|
"name": "DatabricksLinkedService_Serverless",
|
|
"type": "Microsoft.DataFactory/factories/linkedservices",
|
|
"properties": {
|
|
"type": "AzureDatabricks",
|
|
"typeProperties": {
|
|
"domain": "https://adb-123456789.azuredatabricks.net",
|
|
"authentication": "MSI" // ✅ Managed Identity (recommended 2025)
|
|
// ⚠️ NO existingClusterId or newClusterNodeType needed for serverless!
|
|
// The Databricks Job activity automatically uses serverless compute
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Alternative: Access Token Authentication**
|
|
```json
|
|
{
|
|
"name": "DatabricksLinkedService_Token",
|
|
"type": "Microsoft.DataFactory/factories/linkedservices",
|
|
"properties": {
|
|
"type": "AzureDatabricks",
|
|
"typeProperties": {
|
|
"domain": "https://adb-123456789.azuredatabricks.net",
|
|
"accessToken": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": {
|
|
"referenceName": "AzureKeyVault",
|
|
"type": "LinkedServiceReference"
|
|
},
|
|
"secretName": "databricks-access-token"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**🚨 CRITICAL: For Databricks Job activity, DO NOT specify cluster properties in the linked service. The job configuration in Databricks workspace controls compute resources.**
|
|
|
|
## 🆕 2025 New Connectors and Enhancements
|
|
|
|
### ServiceNow V2 Connector (RECOMMENDED - V1 End of Support)
|
|
|
|
**🚨 CRITICAL: ServiceNow V1 connector is at End of Support stage. Migrate to V2 immediately!**
|
|
|
|
**Key Features of V2:**
|
|
- ✅ **Native Query Builder** - Aligns with ServiceNow's condition builder experience
|
|
- ✅ **Enhanced Performance** - Optimized data extraction
|
|
- ✅ **Better Error Handling** - Improved diagnostics and retry logic
|
|
- ✅ **OData Support** - Modern API integration patterns
|
|
|
|
**Copy Activity Example:**
|
|
```json
|
|
{
|
|
"name": "CopyFromServiceNowV2",
|
|
"type": "Copy",
|
|
"inputs": [
|
|
{
|
|
"referenceName": "ServiceNowV2Source",
|
|
"type": "DatasetReference"
|
|
}
|
|
],
|
|
"outputs": [
|
|
{
|
|
"referenceName": "AzureSqlSink",
|
|
"type": "DatasetReference"
|
|
}
|
|
],
|
|
"typeProperties": {
|
|
"source": {
|
|
"type": "ServiceNowV2Source",
|
|
"query": "sysparm_query=active=true^priority=1^sys_created_on>=javascript:gs.dateGenerate('2025-01-01')",
|
|
"httpRequestTimeout": "00:01:40" // 100 seconds
|
|
},
|
|
"sink": {
|
|
"type": "AzureSqlSink",
|
|
"writeBehavior": "upsert",
|
|
"upsertSettings": {
|
|
"useTempDB": true,
|
|
"keys": ["sys_id"]
|
|
}
|
|
},
|
|
"enableStaging": true,
|
|
"stagingSettings": {
|
|
"linkedServiceName": {
|
|
"referenceName": "AzureBlobStorage",
|
|
"type": "LinkedServiceReference"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Linked Service (OAuth2 - Recommended):**
|
|
```json
|
|
{
|
|
"name": "ServiceNowV2LinkedService",
|
|
"type": "Microsoft.DataFactory/factories/linkedservices",
|
|
"properties": {
|
|
"type": "ServiceNowV2",
|
|
"typeProperties": {
|
|
"endpoint": "https://dev12345.service-now.com",
|
|
"authenticationType": "OAuth2",
|
|
"clientId": "your-oauth-client-id",
|
|
"clientSecret": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": {
|
|
"referenceName": "AzureKeyVault",
|
|
"type": "LinkedServiceReference"
|
|
},
|
|
"secretName": "servicenow-client-secret"
|
|
},
|
|
"username": "service-account@company.com",
|
|
"password": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": {
|
|
"referenceName": "AzureKeyVault",
|
|
"type": "LinkedServiceReference"
|
|
},
|
|
"secretName": "servicenow-password"
|
|
},
|
|
"grantType": "password"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Linked Service (Basic Authentication - Legacy):**
|
|
```json
|
|
{
|
|
"name": "ServiceNowV2LinkedService_Basic",
|
|
"type": "Microsoft.DataFactory/factories/linkedservices",
|
|
"properties": {
|
|
"type": "ServiceNowV2",
|
|
"typeProperties": {
|
|
"endpoint": "https://dev12345.service-now.com",
|
|
"authenticationType": "Basic",
|
|
"username": "admin",
|
|
"password": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": {
|
|
"referenceName": "AzureKeyVault",
|
|
"type": "LinkedServiceReference"
|
|
},
|
|
"secretName": "servicenow-password"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Migration from V1 to V2:**
|
|
1. Update linked service type from `ServiceNow` to `ServiceNowV2`
|
|
2. Update source type from `ServiceNowSource` to `ServiceNowV2Source`
|
|
3. Test queries in ServiceNow UI's condition builder first
|
|
4. Adjust timeout settings if needed (V2 may have different performance)
|
|
|
|
### Enhanced PostgreSQL Connector
|
|
|
|
Improved performance and features:
|
|
|
|
```json
|
|
{
|
|
"name": "PostgreSQLLinkedService",
|
|
"type": "PostgreSql",
|
|
"typeProperties": {
|
|
"connectionString": "host=myserver.postgres.database.azure.com;port=5432;database=mydb;uid=myuser",
|
|
"password": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": { "referenceName": "KeyVault" },
|
|
"secretName": "postgres-password"
|
|
},
|
|
// 2025 enhancement
|
|
"enableSsl": true,
|
|
"sslMode": "Require"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Microsoft Fabric Warehouse Connector (NEW 2025)
|
|
|
|
**🆕 Native support for Microsoft Fabric Warehouse (Q3 2024+)**
|
|
|
|
**Supported Activities:**
|
|
- ✅ Copy Activity (source and sink)
|
|
- ✅ Lookup Activity
|
|
- ✅ Get Metadata Activity
|
|
- ✅ Script Activity
|
|
- ✅ Stored Procedure Activity
|
|
|
|
**Linked Service Configuration:**
|
|
```json
|
|
{
|
|
"name": "FabricWarehouseLinkedService",
|
|
"type": "Microsoft.DataFactory/factories/linkedservices",
|
|
"properties": {
|
|
"type": "Warehouse", // ✅ NEW dedicated Fabric Warehouse type
|
|
"typeProperties": {
|
|
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
|
|
"warehouse": "MyWarehouse",
|
|
"authenticationType": "ServicePrincipal", // Recommended
|
|
"servicePrincipalId": "<app-registration-id>",
|
|
"servicePrincipalKey": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": {
|
|
"referenceName": "AzureKeyVault",
|
|
"type": "LinkedServiceReference"
|
|
},
|
|
"secretName": "fabric-warehouse-sp-key"
|
|
},
|
|
"tenant": "<tenant-id>"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Alternative: Managed Identity Authentication (Preferred)**
|
|
```json
|
|
{
|
|
"name": "FabricWarehouseLinkedService_ManagedIdentity",
|
|
"type": "Microsoft.DataFactory/factories/linkedservices",
|
|
"properties": {
|
|
"type": "Warehouse",
|
|
"typeProperties": {
|
|
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
|
|
"warehouse": "MyWarehouse",
|
|
"authenticationType": "SystemAssignedManagedIdentity"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Copy Activity Example:**
|
|
```json
|
|
{
|
|
"name": "CopyToFabricWarehouse",
|
|
"type": "Copy",
|
|
"inputs": [
|
|
{
|
|
"referenceName": "AzureSqlSource",
|
|
"type": "DatasetReference"
|
|
}
|
|
],
|
|
"outputs": [
|
|
{
|
|
"referenceName": "FabricWarehouseSink",
|
|
"type": "DatasetReference"
|
|
}
|
|
],
|
|
"typeProperties": {
|
|
"source": {
|
|
"type": "AzureSqlSource"
|
|
},
|
|
"sink": {
|
|
"type": "WarehouseSink",
|
|
"writeBehavior": "insert", // or "upsert"
|
|
"writeBatchSize": 10000,
|
|
"tableOption": "autoCreate" // Auto-create table if not exists
|
|
},
|
|
"enableStaging": true, // Recommended for large data
|
|
"stagingSettings": {
|
|
"linkedServiceName": {
|
|
"referenceName": "AzureBlobStorage",
|
|
"type": "LinkedServiceReference"
|
|
},
|
|
"path": "staging/fabric-warehouse"
|
|
},
|
|
"translator": {
|
|
"type": "TabularTranslator",
|
|
"mappings": [
|
|
{
|
|
"source": { "name": "CustomerID" },
|
|
"sink": { "name": "customer_id" }
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Best Practices for Fabric Warehouse:**
|
|
- ✅ Use managed identity for authentication (no secret rotation)
|
|
- ✅ Enable staging for large data loads (> 1GB)
|
|
- ✅ Use `tableOption: autoCreate` for dynamic schema creation
|
|
- ✅ Leverage Fabric's lakehouse integration for unified analytics
|
|
- ✅ Monitor Fabric capacity units (CU) consumption
|
|
|
|
### Enhanced Snowflake Connector
|
|
|
|
Improved performance:
|
|
|
|
```json
|
|
{
|
|
"name": "SnowflakeLinkedService",
|
|
"type": "Snowflake",
|
|
"typeProperties": {
|
|
"connectionString": "jdbc:snowflake://myaccount.snowflakecomputing.com",
|
|
"database": "mydb",
|
|
"warehouse": "mywarehouse",
|
|
"authenticationType": "KeyPair",
|
|
"username": "myuser",
|
|
"privateKey": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": { "referenceName": "KeyVault" },
|
|
"secretName": "snowflake-private-key"
|
|
},
|
|
"privateKeyPassphrase": {
|
|
"type": "AzureKeyVaultSecret",
|
|
"store": { "referenceName": "KeyVault" },
|
|
"secretName": "snowflake-passphrase"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Managed Identity for Azure Storage (2025)
|
|
|
|
### Azure Table Storage
|
|
|
|
Now supports system-assigned and user-assigned managed identity:
|
|
|
|
```json
|
|
{
|
|
"name": "AzureTableStorageLinkedService",
|
|
"type": "AzureTableStorage",
|
|
"typeProperties": {
|
|
"serviceEndpoint": "https://mystorageaccount.table.core.windows.net",
|
|
"authenticationType": "ManagedIdentity" // New in 2025
|
|
// Or user-assigned:
|
|
// "credential": {
|
|
// "referenceName": "UserAssignedManagedIdentity"
|
|
// }
|
|
}
|
|
}
|
|
```
|
|
|
|
### Azure Files
|
|
|
|
Now supports managed identity authentication:
|
|
|
|
```json
|
|
{
|
|
"name": "AzureFilesLinkedService",
|
|
"type": "AzureFileStorage",
|
|
"typeProperties": {
|
|
"fileShare": "myshare",
|
|
"accountName": "mystorageaccount",
|
|
"authenticationType": "ManagedIdentity" // New in 2025
|
|
}
|
|
}
|
|
```
|
|
|
|
## Mapping Data Flows - Spark 3.3
|
|
|
|
Spark 3.3 now powers Mapping Data Flows:
|
|
|
|
**Performance Improvements:**
|
|
- 30% faster data processing
|
|
- Improved memory management
|
|
- Better partition handling
|
|
- Enhanced join performance
|
|
|
|
**New Features:**
|
|
- Adaptive Query Execution (AQE)
|
|
- Dynamic partition pruning
|
|
- Improved caching
|
|
- Better column statistics
|
|
|
|
```json
|
|
{
|
|
"name": "DataFlow1",
|
|
"type": "MappingDataFlow",
|
|
"typeProperties": {
|
|
"sources": [
|
|
{
|
|
"dataset": { "referenceName": "SourceDataset" }
|
|
}
|
|
],
|
|
"transformations": [
|
|
{
|
|
"name": "Transform1"
|
|
}
|
|
],
|
|
"sinks": [
|
|
{
|
|
"dataset": { "referenceName": "SinkDataset" }
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Azure DevOps Server 2022 Support
|
|
|
|
Git integration now supports on-premises Azure DevOps Server 2022:
|
|
|
|
```json
|
|
{
|
|
"name": "DataFactory",
|
|
"properties": {
|
|
"repoConfiguration": {
|
|
"type": "AzureDevOpsGit",
|
|
"accountName": "on-prem-ado-server",
|
|
"projectName": "MyProject",
|
|
"repositoryName": "adf-repo",
|
|
"collaborationBranch": "main",
|
|
"rootFolder": "/",
|
|
"hostName": "https://ado-server.company.com" // On-premises server
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## 🔐 Managed Identity 2025 Best Practices
|
|
|
|
### User-Assigned vs System-Assigned Managed Identity
|
|
|
|
**System-Assigned Managed Identity:**
|
|
```json
|
|
{
|
|
"type": "AzureBlobStorage",
|
|
"typeProperties": {
|
|
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
|
|
"accountKind": "StorageV2"
|
|
// ✅ Uses Data Factory's system-assigned identity automatically
|
|
}
|
|
}
|
|
```
|
|
|
|
**User-Assigned Managed Identity (NEW 2025):**
|
|
```json
|
|
{
|
|
"type": "AzureBlobStorage",
|
|
"typeProperties": {
|
|
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
|
|
"accountKind": "StorageV2",
|
|
"credential": {
|
|
"referenceName": "UserAssignedManagedIdentityCredential",
|
|
"type": "CredentialReference"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**When to Use User-Assigned:**
|
|
- ✅ Sharing identity across multiple data factories
|
|
- ✅ Complex multi-environment setups
|
|
- ✅ Granular permission management
|
|
- ✅ Identity lifecycle independent of data factory
|
|
|
|
**Credential Consolidation (NEW 2025):**
|
|
|
|
ADF now supports a centralized **Credentials** feature:
|
|
```json
|
|
{
|
|
"name": "ManagedIdentityCredential",
|
|
"type": "Microsoft.DataFactory/factories/credentials",
|
|
"properties": {
|
|
"type": "ManagedIdentity",
|
|
"typeProperties": {
|
|
"resourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- ✅ Consolidate all Microsoft Entra ID-based credentials in one place
|
|
- ✅ Reuse credentials across multiple linked services
|
|
- ✅ Centralized permission management
|
|
- ✅ Easier audit and compliance tracking
|
|
|
|
### MFA Enforcement Compatibility (October 2025)
|
|
|
|
**🚨 IMPORTANT: Azure requires MFA for all users by October 2025**
|
|
|
|
**Impact on ADF:**
|
|
- ✅ **Managed identities are UNAFFECTED** - No MFA required for service accounts
|
|
- ✅ Continue using system-assigned and user-assigned identities without changes
|
|
- ❌ **Interactive user logins affected** - Personal Azure AD accounts need MFA
|
|
- ✅ **Service principals with certificate auth** - Recommended alternative to secrets
|
|
|
|
**Best Practice:**
|
|
```json
|
|
{
|
|
"type": "AzureSqlDatabase",
|
|
"typeProperties": {
|
|
"server": "myserver.database.windows.net",
|
|
"database": "mydb",
|
|
"authenticationType": "SystemAssignedManagedIdentity"
|
|
// ✅ No MFA needed, no secret rotation, passwordless
|
|
}
|
|
}
|
|
```
|
|
|
|
### Principle of Least Privilege (2025)
|
|
|
|
**Storage Blob Data Roles:**
|
|
- `Storage Blob Data Reader` - Read-only access (source)
|
|
- `Storage Blob Data Contributor` - Read/write access (sink)
|
|
- ❌ Avoid `Storage Blob Data Owner` unless needed
|
|
|
|
**SQL Database Roles:**
|
|
```sql
|
|
-- Create contained database user for managed identity
|
|
CREATE USER [datafactory-name] FROM EXTERNAL PROVIDER;
|
|
|
|
-- Grant minimal required permissions
|
|
ALTER ROLE db_datareader ADD MEMBER [datafactory-name];
|
|
ALTER ROLE db_datawriter ADD MEMBER [datafactory-name];
|
|
|
|
-- ❌ Avoid db_owner unless truly needed
|
|
```
|
|
|
|
**Key Vault Access Policies:**
|
|
```json
|
|
{
|
|
"permissions": {
|
|
"secrets": ["Get"] // ✅ Only Get permission needed
|
|
// ❌ Don't grant List, Set, Delete unless required
|
|
}
|
|
}
|
|
```
|
|
|
|
## Best Practices (2025)
|
|
|
|
1. **Use Databricks Job Activity (MANDATORY):**
|
|
- ❌ STOP using Notebook, Python, JAR activities
|
|
- ✅ Migrate to DatabricksJob activity immediately
|
|
- ✅ Define workflows in Databricks workspace
|
|
- ✅ Leverage serverless compute (no cluster config needed)
|
|
- ✅ Utilize advanced features (Run As, Task Values, If/Else, Repair Runs)
|
|
|
|
2. **Managed Identity Authentication (MANDATORY 2025):**
|
|
- ✅ Use managed identities for ALL Azure resources
|
|
- ✅ Prefer system-assigned for simple scenarios
|
|
- ✅ Use user-assigned for shared identity needs
|
|
- ✅ Leverage Credentials feature for consolidation
|
|
- ✅ MFA-compliant for October 2025 enforcement
|
|
- ❌ Avoid access keys and connection strings
|
|
- ✅ Store any remaining secrets in Key Vault
|
|
|
|
3. **Monitor Job Execution:**
|
|
- Track Databricks Job run IDs from ADF output
|
|
- Log Job parameters for auditability
|
|
- Set up alerts for job failures
|
|
- Use Databricks job-level monitoring
|
|
- Leverage built-in lineage tracking
|
|
|
|
4. **Optimize Spark 3.3 Usage (Data Flows):**
|
|
- Enable Adaptive Query Execution (AQE)
|
|
- Use appropriate partition counts (4-8 per core)
|
|
- Monitor execution plans in Databricks
|
|
- Use broadcast joins for small dimensions
|
|
- Implement dynamic partition pruning
|
|
|
|
## Resources
|
|
|
|
- [Databricks Job Activity](https://learn.microsoft.com/azure/data-factory/transform-data-using-databricks-spark-job)
|
|
- [ADF Connectors](https://learn.microsoft.com/azure/data-factory/connector-overview)
|
|
- [Managed Identity Authentication](https://learn.microsoft.com/azure/data-factory/data-factory-service-identity)
|
|
- [Mapping Data Flows](https://learn.microsoft.com/azure/data-factory/concepts-data-flow-overview)
|