zhongwei/gh-data-goblin-fabric-cli-plugin-fabric-cli-plugin

Fork 0

Files

Zhongwei Li 31ff8e1c29 Initial commit

2025-11-29 18:18:03 +08:00

16 KiB

Raw Blame History

Notebook Operations

Comprehensive guide for working with Fabric notebooks using the Fabric CLI.

Overview

Fabric notebooks are interactive documents for data engineering, data science, and analytics. They can be executed, scheduled, and managed via the CLI.

Getting Notebook Information

Basic Notebook Info

# Check if notebook exists
fab exists "Production.Workspace/ETL Pipeline.Notebook"

# Get notebook properties
fab get "Production.Workspace/ETL Pipeline.Notebook"

# Get with verbose details
fab get "Production.Workspace/ETL Pipeline.Notebook" -v

# Get only notebook ID
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "id"

Get Notebook Definition

# Get full notebook definition
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition"

# Save definition to file
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition" -o /tmp/notebook-def.json

# Get notebook content (cells)
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition.parts[?path=='notebook-content.py'].payload | [0]"

Exporting Notebooks

Export as IPYNB

# Export notebook
fab export "Production.Workspace/ETL Pipeline.Notebook" -o /tmp/notebooks

# This creates:
# /tmp/notebooks/ETL Pipeline.Notebook/
# ├── notebook-content.py (or .ipynb)
# └── metadata files

Export All Notebooks from Workspace

# Export all notebooks
WS_ID=$(fab get "Production.Workspace" -q "id")
NOTEBOOKS=$(fab api "workspaces/$WS_ID/items" -q "value[?type=='Notebook'].displayName")

for NOTEBOOK in $NOTEBOOKS; do
    fab export "Production.Workspace/$NOTEBOOK.Notebook" -o /tmp/notebooks
done

Importing Notebooks

Import from Local

# Import notebook from .ipynb format (default)
fab import "Production.Workspace/New ETL.Notebook" -i /tmp/notebooks/ETL\ Pipeline.Notebook

# Import from .py format
fab import "Production.Workspace/Script.Notebook" -i /tmp/script.py --format py

Copy Between Workspaces

# Copy notebook
fab cp "Dev.Workspace/ETL.Notebook" "Production.Workspace"

# Copy with new name
fab cp "Dev.Workspace/ETL.Notebook" "Production.Workspace/Prod ETL.Notebook"

Creating Notebooks

Create Blank Notebook

# Get workspace ID first
fab get "Production.Workspace" -q "id"

# Create via API
fab api -X post "workspaces/<workspace-id>/notebooks" -i '{"displayName": "New Data Processing"}'

Create and Configure Query Notebook

Use this workflow to create a notebook for querying lakehouse tables with Spark SQL.

Step 1: Create the notebook

# Get workspace ID
fab get "Sales.Workspace" -q "id"
# Returns: 4caf7825-81ac-4c94-9e46-306b4c20a4d5

# Create notebook
fab api -X post "workspaces/4caf7825-81ac-4c94-9e46-306b4c20a4d5/notebooks" -i '{"displayName": "Data Query"}'
# Returns notebook ID: 97bbd18d-c293-46b8-8536-82fb8bc9bd58

Step 2: Get lakehouse ID (required for notebook metadata)

fab get "Sales.Workspace/SalesLH.Lakehouse" -q "id"
# Returns: ddbcc575-805b-4922-84db-ca451b318755

Step 3: Create notebook code in Fabric format

cat > /tmp/notebook.py <<'EOF'
# Fabric notebook source

# METADATA ********************

# META {
# META   "kernel_info": {
# META     "name": "synapse_pyspark"
# META   },
# META   "dependencies": {
# META     "lakehouse": {
# META       "default_lakehouse": "ddbcc575-805b-4922-84db-ca451b318755",
# META       "default_lakehouse_name": "SalesLH",
# META       "default_lakehouse_workspace_id": "4caf7825-81ac-4c94-9e46-306b4c20a4d5"
# META     }
# META   }
# META }

# CELL ********************

# Query lakehouse table
df = spark.sql("""
    SELECT
        date_key,
        COUNT(*) as num_records
    FROM gold.sets
    GROUP BY date_key
    ORDER BY date_key DESC
    LIMIT 10
""")

# IMPORTANT: Convert to pandas and print to capture output
# display(df) will NOT show results via API
pandas_df = df.toPandas()
print(pandas_df)
print(f"\nLatest date: {pandas_df.iloc[0]['date_key']}")
EOF

Step 4: Base64 encode and create update definition

base64 -i /tmp/notebook.py > /tmp/notebook-b64.txt

cat > /tmp/update.json <<EOF
{
  "definition": {
    "parts": [
      {
        "path": "notebook-content.py",
        "payload": "$(cat /tmp/notebook-b64.txt)",
        "payloadType": "InlineBase64"
      }
    ]
  }
}
EOF

Step 5: Update notebook with code

fab api -X post "workspaces/4caf7825-81ac-4c94-9e46-306b4c20a4d5/notebooks/97bbd18d-c293-46b8-8536-82fb8bc9bd58/updateDefinition" -i /tmp/update.json --show_headers
# Returns operation ID in Location header

Step 6: Check update completed

fab api "operations/<operation-id>"
# Wait for status: "Succeeded"

Step 7: Run the notebook

fab job start "Sales.Workspace/Data Query.Notebook"
# Returns job instance ID

Step 8: Check execution status

fab job run-status "Sales.Workspace/Data Query.Notebook" --id <job-id>
# Wait for status: "Completed"

Step 9: Get results (download from Fabric UI)

Open notebook in Fabric UI after execution
Print output will be visible in cell outputs
Download .ipynb file to see printed results locally

Critical Requirements

File format: Must be notebook-content.py (NOT .ipynb)
Lakehouse ID: Must include default_lakehouse ID in metadata (not just name)
Spark session: Will be automatically available when lakehouse is attached
Capturing output: Use df.toPandas() and print() - display() won't show in API
Results location: Print output visible in UI and downloaded .ipynb, NOT in definition

Common Issues

NameError: name 'spark' is not defined - Lakehouse not attached (missing default_lakehouse ID)
Job "Completed" but no results - Used display() instead of print()
Update fails - Used .ipynb path instead of .py

Create from Template

# Export template
fab export "Templates.Workspace/Template Notebook.Notebook" -o /tmp/templates

# Import as new notebook
fab import "Production.Workspace/Custom Notebook.Notebook" -i /tmp/templates/Template\ Notebook.Notebook

Running Notebooks

Run Synchronously (Wait for Completion)

# Run notebook and wait
fab job run "Production.Workspace/ETL Pipeline.Notebook"

# Run with timeout (seconds)
fab job run "Production.Workspace/Long Process.Notebook" --timeout 600

Run with Parameters

# Run with basic parameters
fab job run "Production.Workspace/ETL Pipeline.Notebook" -P \
  date:string=2024-01-01,\
  batch_size:int=1000,\
  debug_mode:bool=false,\
  threshold:float=0.95

# Parameters must match types defined in notebook
# Supported types: string, int, float, bool

Run with Spark Configuration

# Run with custom Spark settings
fab job run "Production.Workspace/Big Data Processing.Notebook" -C '{
  "conf": {
    "spark.executor.memory": "8g",
    "spark.executor.cores": "4",
    "spark.dynamicAllocation.enabled": "true"
  },
  "environment": {
    "id": "<environment-id>",
    "name": "Production Environment"
  }
}'

# Run with default lakehouse
fab job run "Production.Workspace/Data Ingestion.Notebook" -C '{
  "defaultLakehouse": {
    "name": "MainLakehouse",
    "id": "<lakehouse-id>",
    "workspaceId": "<workspace-id>"
  }
}'

# Run with workspace Spark pool
fab job run "Production.Workspace/Analytics.Notebook" -C '{
  "useStarterPool": false,
  "useWorkspacePool": "HighMemoryPool"
}'

Run with Combined Parameters and Configuration

# Combine parameters and configuration
fab job run "Production.Workspace/ETL Pipeline.Notebook" \
  -P date:string=2024-01-01,batch:int=500 \
  -C '{
    "defaultLakehouse": {"name": "StagingLH", "id": "<lakehouse-id>"},
    "conf": {"spark.sql.shuffle.partitions": "200"}
  }'

Run Asynchronously

# Start notebook and return immediately
JOB_ID=$(fab job start "Production.Workspace/ETL Pipeline.Notebook" | grep -o '"id": "[^"]*"' | cut -d'"' -f4)

# Check status later
fab job run-status "Production.Workspace/ETL Pipeline.Notebook" --id "$JOB_ID"

Monitoring Notebook Executions

Get Job Status

# Check specific job
fab job run-status "Production.Workspace/ETL Pipeline.Notebook" --id <job-id>

# Get detailed status via API
WS_ID=$(fab get "Production.Workspace" -q "id")
NOTEBOOK_ID=$(fab get "Production.Workspace/ETL Pipeline.Notebook" -q "id")
fab api "workspaces/$WS_ID/items/$NOTEBOOK_ID/jobs/instances/<job-id>"

List Execution History

# List all job runs
fab job run-list "Production.Workspace/ETL Pipeline.Notebook"

# List only scheduled runs
fab job run-list "Production.Workspace/ETL Pipeline.Notebook" --schedule

# Get latest run status
fab job run-list "Production.Workspace/ETL Pipeline.Notebook" | head -n 1

Cancel Running Job

fab job run-cancel "Production.Workspace/ETL Pipeline.Notebook" --id <job-id>

Scheduling Notebooks

Create Cron Schedule

# Run every 30 minutes
fab job run-sch "Production.Workspace/ETL Pipeline.Notebook" \
  --type cron \
  --interval 30 \
  --start 2024-11-15T00:00:00 \
  --end 2025-12-31T23:59:00 \
  --enable

Create Daily Schedule

# Run daily at 2 AM and 2 PM
fab job run-sch "Production.Workspace/ETL Pipeline.Notebook" \
  --type daily \
  --interval 02:00,14:00 \
  --start 2024-11-15T00:00:00 \
  --end 2025-12-31T23:59:00 \
  --enable

Create Weekly Schedule

# Run Monday and Friday at 9 AM
fab job run-sch "Production.Workspace/Weekly Report.Notebook" \
  --type weekly \
  --interval 09:00 \
  --days Monday,Friday \
  --start 2024-11-15T00:00:00 \
  --enable

Update Schedule

# Modify existing schedule
fab job run-update "Production.Workspace/ETL Pipeline.Notebook" \
  --id <schedule-id> \
  --type daily \
  --interval 03:00 \
  --enable

# Disable schedule
fab job run-update "Production.Workspace/ETL Pipeline.Notebook" \
  --id <schedule-id> \
  --disable

Notebook Configuration

Set Default Lakehouse

# Via notebook properties
fab set "Production.Workspace/ETL.Notebook" -q lakehouse -i '{
  "known_lakehouses": [{"id": "<lakehouse-id>"}],
  "default_lakehouse": "<lakehouse-id>",
  "default_lakehouse_name": "MainLakehouse",
  "default_lakehouse_workspace_id": "<workspace-id>"
}'

Set Default Environment

fab set "Production.Workspace/ETL.Notebook" -q environment -i '{
  "environmentId": "<environment-id>",
  "workspaceId": "<workspace-id>"
}'

Set Default Warehouse

fab set "Production.Workspace/Analytics.Notebook" -q warehouse -i '{
  "known_warehouses": [{"id": "<warehouse-id>", "type": "Datawarehouse"}],
  "default_warehouse": "<warehouse-id>"
}'

Updating Notebooks

Update Display Name

fab set "Production.Workspace/ETL.Notebook" -q displayName -i "ETL Pipeline v2"

Update Description

fab set "Production.Workspace/ETL.Notebook" -q description -i "Daily ETL pipeline for sales data ingestion and transformation"

Deleting Notebooks

# Delete with confirmation (interactive)
fab rm "Dev.Workspace/Old Notebook.Notebook"

# Force delete without confirmation
fab rm "Dev.Workspace/Old Notebook.Notebook" -f

Advanced Workflows

Parameterized Notebook Execution

# Create parametrized notebook with cell tagged as "parameters"
# In notebook, create cell:
date = "2024-01-01"  # default
batch_size = 1000    # default
debug = False        # default

# Execute with different parameters
fab job run "Production.Workspace/Parameterized.Notebook" -P \
  date:string=2024-02-15,\
  batch_size:int=2000,\
  debug:bool=true

Notebook Orchestration Pipeline

#!/bin/bash

WORKSPACE="Production.Workspace"
DATE=$(date +%Y-%m-%d)

# 1. Run ingestion notebook
echo "Starting data ingestion..."
fab job run "$WORKSPACE/1_Ingest_Data.Notebook" -P date:string=$DATE

# 2. Run transformation notebook
echo "Running transformations..."
fab job run "$WORKSPACE/2_Transform_Data.Notebook" -P date:string=$DATE

# 3. Run analytics notebook
echo "Generating analytics..."
fab job run "$WORKSPACE/3_Analytics.Notebook" -P date:string=$DATE

# 4. Run reporting notebook
echo "Creating reports..."
fab job run "$WORKSPACE/4_Reports.Notebook" -P date:string=$DATE

echo "Pipeline completed for $DATE"

Monitoring Long-Running Notebook

#!/bin/bash

NOTEBOOK="Production.Workspace/Long Process.Notebook"

# Start job
JOB_ID=$(fab job start "$NOTEBOOK" -P date:string=$(date +%Y-%m-%d) | \
  grep -o '"id": "[^"]*"' | head -1 | cut -d'"' -f4)

echo "Started job: $JOB_ID"

# Poll status every 30 seconds
while true; do
  STATUS=$(fab job run-status "$NOTEBOOK" --id "$JOB_ID" | \
    grep -o '"status": "[^"]*"' | cut -d'"' -f4)

  echo "[$(date +%H:%M:%S)] Status: $STATUS"

  if [[ "$STATUS" == "Completed" ]] || [[ "$STATUS" == "Failed" ]]; then
    break
  fi

  sleep 30
done

if [[ "$STATUS" == "Completed" ]]; then
  echo "Job completed successfully"
  exit 0
else
  echo "Job failed"
  exit 1
fi

Conditional Notebook Execution

#!/bin/bash

WORKSPACE="Production.Workspace"

# Check if data is ready
DATA_READY=$(fab api "workspaces/<ws-id>/lakehouses/<lh-id>/Files/ready.flag" 2>&1 | grep -c "200")

if [ "$DATA_READY" -eq 1 ]; then
  echo "Data ready, running notebook..."
  fab job run "$WORKSPACE/Process Data.Notebook" -P date:string=$(date +%Y-%m-%d)
else
  echo "Data not ready, skipping execution"
fi

Notebook Definition Structure

Notebook definition contains:

NotebookName.Notebook/ ├── .platform # Git integration metadata ├── notebook-content.py # Python code (or .ipynb format) └── metadata.json # Notebook metadata

Query Notebook Content

NOTEBOOK="Production.Workspace/ETL.Notebook"

# Get Python code content
fab get "$NOTEBOOK" -q "definition.parts[?path=='notebook-content.py'].payload | [0]" | base64 -d

# Get metadata
fab get "$NOTEBOOK" -q "definition.parts[?path=='metadata.json'].payload | [0]" | base64 -d | jq .

Troubleshooting

Notebook Execution Failures

# Check recent execution
fab job run-list "Production.Workspace/ETL.Notebook" | head -n 5

# Get detailed error
fab job run-status "Production.Workspace/ETL.Notebook" --id <job-id> -q "error"

# Common issues:
# - Lakehouse not attached
# - Invalid parameters
# - Spark configuration errors
# - Missing dependencies

Parameter Type Mismatches

# Parameters must match expected types
# ❌ Wrong: -P count:string=100    (should be int)
# ✅ Right: -P count:int=100

# Check notebook definition for parameter types
fab get "Production.Workspace/ETL.Notebook" -q "definition.parts[?path=='notebook-content.py']"

Lakehouse Access Issues

# Verify lakehouse exists and is accessible
fab exists "Production.Workspace/MainLakehouse.Lakehouse"

# Check notebook's lakehouse configuration
fab get "Production.Workspace/ETL.Notebook" -q "properties.lakehouse"

# Re-attach lakehouse
fab set "Production.Workspace/ETL.Notebook" -q lakehouse -i '{
  "known_lakehouses": [{"id": "<lakehouse-id>"}],
  "default_lakehouse": "<lakehouse-id>",
  "default_lakehouse_name": "MainLakehouse",
  "default_lakehouse_workspace_id": "<workspace-id>"
}'

Performance Tips

Use workspace pools: Faster startup than starter pool
Cache data in lakehouses: Avoid re-fetching data
Parameterize notebooks: Reuse logic with different inputs
Monitor execution time: Set appropriate timeouts
Use async execution: Don't block on long-running notebooks
Optimize Spark config: Tune for specific workloads

Best Practices

Tag parameter cells: Use "parameters" tag for injected params
Handle failures gracefully: Add error handling and logging
Version control notebooks: Export and commit to Git
Use descriptive names: Clear naming for scheduled jobs
Document parameters: Add comments explaining expected inputs
Test locally first: Validate in development workspace
Monitor schedules: Review execution history regularly
Clean up old notebooks: Remove unused notebooks

Security Considerations

Credential management: Use Key Vault for secrets
Workspace permissions: Control who can execute notebooks
Parameter validation: Sanitize inputs in notebook code
Data access: Respect lakehouse/warehouse permissions
Logging: Don't log sensitive information

scripts/run_notebook_pipeline.py - Orchestrate multiple notebooks
scripts/monitor_notebook.py - Monitor long-running executions
scripts/export_notebook.py - Export with validation
scripts/schedule_notebook.py - Simplified scheduling interface

16 KiB Raw Blame History