Initial commit

2025-11-30 08:28:47 +08:00
commit 1458762357
9 changed files with 3659 additions and 0 deletions
--- a/skills/adf-validation-rules/SKILL.md
+++ b/skills/adf-validation-rules/SKILL.md
@@ -0,0 +1,604 @@
+---
+name: adf-validation-rules
+description: Comprehensive Azure Data Factory validation rules, activity nesting limitations, linked service requirements, and edge-case handling guidance
+---
+
+## 🚨 CRITICAL GUIDELINES
+
+### Windows File Path Requirements
+
+**MANDATORY: Always Use Backslashes on Windows for File Paths**
+
+When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
+
+**Examples:**
+- ❌ WRONG: `D:/repos/project/file.tsx`
+- ✅ CORRECT: `D:\repos\project\file.tsx`
+
+This applies to:
+- Edit tool file_path parameter
+- Write tool file_path parameter
+- All file operations on Windows systems
+
+
+### Documentation Guidelines
+
+**NEVER create new documentation files unless explicitly requested by the user.**
+
+- **Priority**: Update existing README.md files rather than creating new documentation
+- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
+- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
+- **User preference**: Only create additional .md files when user specifically asks for documentation
+
+
+---
+
+# Azure Data Factory Validation Rules and Limitations
+
+## 🚨 CRITICAL: Activity Nesting Limitations
+
+Azure Data Factory has **STRICT** nesting rules for control flow activities. Violating these rules will cause pipeline failures or prevent pipeline creation.
+
+### Supported Control Flow Activities for Nesting
+
+Four control flow activities support nested activities:
+- **ForEach**: Iterates over collections and executes activities in a loop
+- **If Condition**: Branches based on true/false evaluation
+- **Until**: Implements do-until loops with timeout options
+- **Switch**: Evaluates activities matching case conditions
+
+### ✅ PERMITTED Nesting Combinations
+
+| Parent Activity | Can Contain | Notes |
+|----------------|-------------|-------|
+| **ForEach** | If Condition | ✅ Allowed |
+| **ForEach** | Switch | ✅ Allowed |
+| **Until** | If Condition | ✅ Allowed |
+| **Until** | Switch | ✅ Allowed |
+
+### ❌ PROHIBITED Nesting Combinations
+
+| Parent Activity | CANNOT Contain | Reason |
+|----------------|----------------|---------|
+| **If Condition** | ForEach | ❌ Not supported - use Execute Pipeline workaround |
+| **If Condition** | Switch | ❌ Not supported - use Execute Pipeline workaround |
+| **If Condition** | Until | ❌ Not supported - use Execute Pipeline workaround |
+| **If Condition** | Another If | ❌ Cannot nest If within If |
+| **Switch** | ForEach | ❌ Not supported - use Execute Pipeline workaround |
+| **Switch** | If Condition | ❌ Not supported - use Execute Pipeline workaround |
+| **Switch** | Until | ❌ Not supported - use Execute Pipeline workaround |
+| **Switch** | Another Switch | ❌ Cannot nest Switch within Switch |
+| **ForEach** | Another ForEach | ❌ Single level only - use Execute Pipeline workaround |
+| **Until** | Another Until | ❌ Single level only - use Execute Pipeline workaround |
+| **ForEach** | Until | ❌ Single level only - use Execute Pipeline workaround |
+| **Until** | ForEach | ❌ Single level only - use Execute Pipeline workaround |
+
+### 🚫 Special Activity Restrictions
+
+**Validation Activity**:
+- ❌ **CANNOT** be placed inside ANY nested activity
+- ❌ **CANNOT** be used within ForEach, If, Switch, or Until activities
+- ✅ Must be at pipeline root level only
+
+### 🔧 Workaround: Execute Pipeline Pattern
+
+**The ONLY supported workaround for prohibited nesting combinations:**
+
+Instead of direct nesting, use the **Execute Pipeline Activity** to call a child pipeline:
+
+```json
+{
+  "name": "ParentPipeline_WithIfCondition",
+  "activities": [
+    {
+      "name": "IfCondition_Parent",
+      "type": "IfCondition",
+      "typeProperties": {
+        "expression": "@equals(pipeline().parameters.ProcessData, 'true')",
+        "ifTrueActivities": [
+          {
+            "name": "ExecuteChildPipeline_WithForEach",
+            "type": "ExecutePipeline",
+            "typeProperties": {
+              "pipeline": {
+                "referenceName": "ChildPipeline_ForEachLoop",
+                "type": "PipelineReference"
+              },
+              "parameters": {
+                "ItemList": "@pipeline().parameters.Items"
+              }
+            }
+          }
+        ]
+      }
+    }
+  ]
+}
+```
+
+**Child Pipeline Structure:**
+```json
+{
+  "name": "ChildPipeline_ForEachLoop",
+  "parameters": {
+    "ItemList": {"type": "array"}
+  },
+  "activities": [
+    {
+      "name": "ForEach_InChildPipeline",
+      "type": "ForEach",
+      "typeProperties": {
+        "items": "@pipeline().parameters.ItemList",
+        "activities": [
+          // Your ForEach logic here
+        ]
+      }
+    }
+  ]
+}
+```
+
+**Why This Works:**
+- Each pipeline can have ONE level of nesting
+- Execute Pipeline creates a new pipeline context
+- Child pipeline gets its own nesting level allowance
+- Enables unlimited depth through pipeline chaining
+
+## 🔢 Activity and Resource Limits
+
+### Pipeline Limits
+| Resource | Limit | Notes |
+|----------|-------|-------|
+| **Activities per pipeline** | 80 | Includes inner activities for containers |
+| **Parameters per pipeline** | 50 | - |
+| **ForEach concurrent iterations** | 50 (maximum) | Set via `batchCount` property |
+| **ForEach items** | 100,000 | - |
+| **Lookup activity rows** | 5,000 | Maximum rows returned |
+| **Lookup activity size** | 4 MB | Maximum size of returned data |
+| **Web activity timeout** | 1 hour | Default timeout for Web activities |
+| **Copy activity timeout** | 7 days | Maximum execution time |
+
+### ForEach Activity Configuration
+```json
+{
+  "name": "ForEachActivity",
+  "type": "ForEach",
+  "typeProperties": {
+    "items": "@pipeline().parameters.ItemList",
+    "isSequential": false,  // false = parallel execution
+    "batchCount": 50,       // Max 50 concurrent iterations
+    "activities": [
+      // Nested activities
+    ]
+  }
+}
+```
+
+**Critical Considerations:**
+- `isSequential: true` → Executes one item at a time (slow but predictable)
+- `isSequential: false` → Executes up to `batchCount` items in parallel
+- Maximum `batchCount` is **50** regardless of setting
+- **Cannot use Set Variable activity** inside parallel ForEach (variable scope is pipeline-level)
+
+### Set Variable Activity Limitations
+❌ **CANNOT** use `Set Variable` inside ForEach with `isSequential: false`
+- Reason: Variables are pipeline-scoped, not ForEach-scoped
+- Multiple parallel iterations would cause race conditions
+- ✅ **Alternative**: Use `Append Variable` with array type, or use sequential execution
+
+## 📊 Linked Services: Azure Blob Storage
+
+### Authentication Methods
+
+#### 1. Account Key (Basic)
+```json
+{
+  "type": "AzureBlobStorage",
+  "typeProperties": {
+    "connectionString": {
+      "type": "SecureString",
+      "value": "DefaultEndpointsProtocol=https;AccountName=<account>;AccountKey=<key>"
+    }
+  }
+}
+```
+**⚠️ Limitations:**
+- Secondary Blob service endpoints are **NOT supported**
+- **Security Risk**: Account keys should be stored in Azure Key Vault
+
+#### 2. Shared Access Signature (SAS)
+```json
+{
+  "type": "AzureBlobStorage",
+  "typeProperties": {
+    "sasUri": {
+      "type": "SecureString",
+      "value": "https://<account>.blob.core.windows.net/<container>?<SAS-token>"
+    }
+  }
+}
+```
+**Critical Requirements:**
+- Dataset `folderPath` must be **absolute path from container level**
+- SAS token expiry **must extend beyond pipeline execution**
+- SAS URI path must align with dataset configuration
+
+#### 3. Service Principal
+```json
+{
+  "type": "AzureBlobStorage",
+  "typeProperties": {
+    "serviceEndpoint": "https://<account>.blob.core.windows.net",
+    "accountKind": "StorageV2",  // REQUIRED for service principal
+    "servicePrincipalId": "<client-id>",
+    "servicePrincipalCredential": {
+      "type": "SecureString",
+      "value": "<client-secret>"
+    },
+    "tenant": "<tenant-id>"
+  }
+}
+```
+**Critical Requirements:**
+- `accountKind` **MUST** be set (StorageV2, BlobStorage, or BlockBlobStorage)
+- Service Principal requires **Storage Blob Data Reader** (source) or **Storage Blob Data Contributor** (sink) role
+- ❌ **NOT compatible** with soft-deleted blob accounts in Data Flow
+
+#### 4. Managed Identity (Recommended)
+```json
+{
+  "type": "AzureBlobStorage",
+  "typeProperties": {
+    "serviceEndpoint": "https://<account>.blob.core.windows.net",
+    "accountKind": "StorageV2"  // REQUIRED for managed identity
+  },
+  "connectVia": {
+    "referenceName": "AutoResolveIntegrationRuntime",
+    "type": "IntegrationRuntimeReference"
+  }
+}
+```
+**Critical Requirements:**
+- `accountKind` **MUST** be specified (cannot be empty or "Storage")
+- ❌ Empty or "Storage" account kind will cause Data Flow failures
+- Managed identity must have **Storage Blob Data Reader/Contributor** role assigned
+- For Storage firewall: **Must enable "Allow trusted Microsoft services"**
+
+### Common Blob Storage Pitfalls
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Data Flow fails with managed identity | `accountKind` empty or "Storage" | Set `accountKind` to StorageV2 |
+| Secondary endpoint doesn't work | Using account key auth | Not supported - use different auth method |
+| SAS token expired during run | Token expiry too short | Extend SAS token validity period |
+| Cannot access $logs container | System container not visible in UI | Use direct path reference |
+| Soft-deleted blobs inaccessible | Service principal/managed identity | Use account key or SAS instead |
+| Private endpoint connection fails | Wrong endpoint for Data Flow | Ensure ADLS Gen2 private endpoint exists |
+
+## 📊 Linked Services: Azure SQL Database
+
+### Authentication Methods
+
+#### 1. SQL Authentication
+```json
+{
+  "type": "AzureSqlDatabase",
+  "typeProperties": {
+    "server": "<server-name>.database.windows.net",
+    "database": "<database-name>",
+    "authenticationType": "SQL",
+    "userName": "<username>",
+    "password": {
+      "type": "SecureString",
+      "value": "<password>"
+    }
+  }
+}
+```
+**Best Practice:**
+- Store password in Azure Key Vault
+- Use connection string with Key Vault reference
+
+#### 2. Service Principal
+```json
+{
+  "type": "AzureSqlDatabase",
+  "typeProperties": {
+    "server": "<server-name>.database.windows.net",
+    "database": "<database-name>",
+    "authenticationType": "ServicePrincipal",
+    "servicePrincipalId": "<client-id>",
+    "servicePrincipalCredential": {
+      "type": "SecureString",
+      "value": "<client-secret>"
+    },
+    "tenant": "<tenant-id>"
+  }
+}
+```
+**Requirements:**
+- Microsoft Entra admin must be configured on SQL server
+- Service principal must have contained database user created
+- Grant appropriate role: `db_datareader`, `db_datawriter`, etc.
+
+#### 3. Managed Identity
+```json
+{
+  "type": "AzureSqlDatabase",
+  "typeProperties": {
+    "server": "<server-name>.database.windows.net",
+    "database": "<database-name>",
+    "authenticationType": "SystemAssignedManagedIdentity"
+  }
+}
+```
+**Requirements:**
+- Create contained database user for managed identity
+- Grant appropriate database roles
+- Configure firewall to allow Azure services (or specific IP ranges)
+
+### SQL Database Configuration Best Practices
+
+#### Connection String Parameters
+```
+Server=tcp:<server>.database.windows.net,1433;
+Database=<database>;
+Encrypt=mandatory;          // Options: mandatory, optional, strict
+TrustServerCertificate=false;
+ConnectTimeout=30;
+CommandTimeout=120;
+Pooling=true;
+ConnectRetryCount=3;
+ConnectRetryInterval=10;
+```
+
+**Critical Parameters:**
+- `Encrypt`: Default is `mandatory` (recommended)
+- `Pooling`: Set to `false` if experiencing idle connection issues
+- `ConnectRetryCount`: Recommended for transient fault handling
+- `ConnectRetryInterval`: Seconds between retries
+
+### Common SQL Database Pitfalls
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Serverless tier auto-paused | Pipeline doesn't wait for resume | Implement retry logic or keep-alive |
+| Connection pool timeout | Idle connections closed | Add `Pooling=false` or configure retry |
+| Firewall blocks connection | IP not whitelisted | Add Azure IR IPs or enable Azure services |
+| Always Encrypted fails in Data Flow | Not supported for sink | Use service principal/managed identity in copy activity |
+| Decimal precision loss | Copy supports up to 28 precision | Use string type for higher precision |
+| Parallel copy not working | No partition configuration | Enable physical or dynamic range partitioning |
+
+### Performance Optimization
+
+#### Parallel Copy Configuration
+```json
+{
+  "source": {
+    "type": "AzureSqlSource",
+    "partitionOption": "PhysicalPartitionsOfTable"  // or "DynamicRange"
+  },
+  "parallelCopies": 8,  // Recommended: (DIU or IR nodes) × (2 to 4)
+  "enableStaging": true,
+  "stagingSettings": {
+    "linkedServiceName": {
+      "referenceName": "AzureBlobStorage",
+      "type": "LinkedServiceReference"
+    }
+  }
+}
+```
+
+**Partition Options:**
+- `PhysicalPartitionsOfTable`: Uses SQL Server physical partitions
+- `DynamicRange`: Creates logical partitions based on column values
+- `None`: No partitioning (default)
+
+**Staging Best Practices:**
+- Always use staging for large data movements (> 1GB)
+- Use PolyBase or COPY statement for best performance
+- Parquet format recommended for staging files
+
+## 🔍 Data Flow Limitations
+
+### General Limits
+- **Column name length**: 128 characters maximum
+- **Row size**: 1 MB maximum (some sinks like SQL have lower limits)
+- **String column size**: Varies by sink (SQL: 8000 for varchar, 4000 for nvarchar)
+
+### Transformation-Specific Limits
+| Transformation | Limitation |
+|----------------|------------|
+| **Lookup** | Cache size limited by cluster memory |
+| **Join** | Large joins may cause memory errors |
+| **Pivot** | Maximum 10,000 unique values |
+| **Window** | Requires partitioning for large datasets |
+
+### Performance Considerations
+- **Partitioning**: Always partition large datasets before transformations
+- **Broadcast**: Use broadcast hint for small dimension tables
+- **Sink optimization**: Enable table option "Recreate" instead of "Truncate" for better performance
+
+## 🛡️ Validation Checklist for Pipeline Creation
+
+### Before Creating Pipeline
+- [ ] Verify activity nesting follows permitted combinations
+- [ ] Check ForEach activities don't contain other ForEach/Until
+- [ ] Verify If/Switch activities don't contain ForEach/Until/If/Switch
+- [ ] Ensure Validation activities are at pipeline root level only
+- [ ] Confirm total activities < 80 per pipeline
+- [ ] Verify no Set Variable activities in parallel ForEach
+
+### Linked Service Validation
+- [ ] **Blob Storage**: If using managed identity/service principal, `accountKind` is set
+- [ ] **SQL Database**: Authentication method matches security requirements
+- [ ] **All services**: Secrets stored in Key Vault, not hardcoded
+- [ ] **All services**: Firewall rules configured for integration runtime IPs
+- [ ] **Network**: Private endpoints configured if using VNet integration
+
+### Activity Configuration Validation
+- [ ] **ForEach**: `batchCount` ≤ 50 if parallel execution
+- [ ] **Lookup**: Query returns < 5000 rows and < 4 MB data
+- [ ] **Copy**: DIU configured appropriately (2-256 for Azure IR)
+- [ ] **Copy**: Staging enabled for large data movements
+- [ ] **All activities**: Timeout values appropriate for expected execution time
+- [ ] **All activities**: Retry logic configured for transient failures
+
+### Data Flow Validation
+- [ ] Column names ≤ 128 characters
+- [ ] Source query doesn't return > 1 MB per row
+- [ ] Partitioning configured for large datasets
+- [ ] Sink has appropriate schema and data type mappings
+- [ ] Staging linked service configured for optimal performance
+
+## 🔍 Automated Validation Script
+
+**CRITICAL: Always run automated validation before committing or deploying ADF pipelines!**
+
+The adf-master plugin includes a comprehensive PowerShell validation script that checks for ALL the rules and limitations documented above.
+
+### Using the Validation Script
+
+**Location:** `${CLAUDE_PLUGIN_ROOT}/scripts/validate-adf-pipelines.ps1`
+
+**Basic usage:**
+```powershell
+# From the root of your ADF repository
+pwsh -File validate-adf-pipelines.ps1
+```
+
+**With custom paths:**
+```powershell
+pwsh -File validate-adf-pipelines.ps1 `
+    -PipelinePath "path/to/pipeline" `
+    -DatasetPath "path/to/dataset"
+```
+
+**With strict mode (additional warnings):**
+```powershell
+pwsh -File validate-adf-pipelines.ps1 -Strict
+```
+
+### What the Script Validates
+
+The automated validation script checks for issues that Microsoft's official `@microsoft/azure-data-factory-utilities` package does **NOT** validate:
+
+1. **Activity Nesting Violations:**
+   - ForEach → ForEach, Until, Validation
+   - Until → Until, ForEach, Validation
+   - IfCondition → ForEach, If, IfCondition, Switch, Until, Validation
+   - Switch → ForEach, If, IfCondition, Switch, Until, Validation
+
+2. **Resource Limits:**
+   - Pipeline activity count (max 120, warn at 100)
+   - Pipeline parameter count (max 50)
+   - Pipeline variable count (max 50)
+   - ForEach batchCount limit (max 50, warn at 30 in strict mode)
+
+3. **Variable Scope Violations:**
+   - SetVariable in parallel ForEach (causes race conditions)
+   - Proper AppendVariable vs SetVariable usage
+
+4. **Dataset Configuration Issues:**
+   - Missing fileName or wildcardFileName for file-based datasets
+   - AzureBlobFSLocation missing required fileSystem property
+   - Missing required properties for DelimitedText, Json, Parquet types
+
+5. **Copy Activity Validations:**
+   - Source/sink type compatibility with dataset types
+   - Lookup activity firstRowOnly=false warnings (5000 row/4MB limits)
+   - Blob file dependencies (additionalColumns logging pattern)
+
+### Integration with CI/CD
+
+**GitHub Actions example:**
+```yaml
+- name: Validate ADF Pipelines
+  run: |
+    pwsh -File validate-adf-pipelines.ps1 -PipelinePath pipeline -DatasetPath dataset
+  shell: pwsh
+```
+
+**Azure DevOps example:**
+```yaml
+- task: PowerShell@2
+  displayName: 'Validate ADF Pipelines'
+  inputs:
+    filePath: 'validate-adf-pipelines.ps1'
+    arguments: '-PipelinePath pipeline -DatasetPath dataset'
+    pwsh: true
+```
+
+### Command Reference
+
+Use the `/adf-validate` command to run the validation script with proper guidance:
+
+```bash
+/adf-validate
+```
+
+This command will:
+1. Detect your ADF repository structure
+2. Run the validation script with appropriate paths
+3. Parse and explain any errors or warnings found
+4. Provide specific solutions for each violation
+5. Recommend next actions based on results
+6. Suggest CI/CD integration patterns
+
+### Exit Codes
+
+- **0**: Validation passed (no errors)
+- **1**: Validation failed (errors found - DO NOT DEPLOY)
+
+### Best Practices
+
+1. **Run validation before every commit** to catch issues early
+2. **Add validation to CI/CD pipeline** to prevent invalid deployments
+3. **Use strict mode during development** for additional warnings
+4. **Re-validate after bulk changes** or generated pipelines
+5. **Document validation exceptions** if you must bypass a warning
+6. **Share validation results with team** to prevent repeated mistakes
+
+## 🚨 CRITICAL: Enforcement Protocol
+
+**When creating or modifying ADF pipelines:**
+
+1. **ALWAYS validate activity nesting** against the permitted/prohibited table
+2. **REJECT** any attempt to create prohibited nesting combinations
+3. **SUGGEST** Execute Pipeline workaround for complex nesting needs
+4. **VALIDATE** linked service authentication matches the connector type
+5. **CHECK** all limits (activities, parameters, ForEach iterations, etc.)
+6. **VERIFY** required properties are set (e.g., `accountKind` for managed identity)
+7. **WARN** about common pitfalls specific to the connector being used
+
+**Example Validation Response:**
+```
+❌ INVALID PIPELINE STRUCTURE DETECTED:
+
+Issue: ForEach activity contains another ForEach activity
+Location: Pipeline "PL_DataProcessing" → ForEach "OuterLoop" → ForEach "InnerLoop"
+
+This violates Azure Data Factory nesting rules:
+- ForEach activities support only a SINGLE level of nesting
+- You CANNOT nest ForEach within ForEach
+
+✅ RECOMMENDED SOLUTION:
+Use the Execute Pipeline pattern:
+1. Create a child pipeline with the inner ForEach logic
+2. Replace the inner ForEach with an Execute Pipeline activity
+3. Pass required parameters to the child pipeline
+
+Would you like me to generate the refactored pipeline structure?
+```
+
+## 📚 Reference Documentation
+
+**Official Microsoft Learn Resources:**
+- Activity nesting: https://learn.microsoft.com/en-us/azure/data-factory/concepts-nested-activities
+- Blob Storage connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage
+- SQL Database connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database
+- Pipeline limits: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#data-factory-limits
+
+**Last Updated:** 2025-01-24 (Based on official Microsoft documentation)
+
+This validation rules skill MUST be consulted before creating or modifying ANY Azure Data Factory pipeline to ensure compliance with platform limitations and best practices.