Files
2025-11-29 18:18:03 +08:00

16 KiB

Notebook Operations

Comprehensive guide for working with Fabric notebooks using the Fabric CLI.

Overview

Fabric notebooks are interactive documents for data engineering, data science, and analytics. They can be executed, scheduled, and managed via the CLI.

Getting Notebook Information

Basic Notebook Info

# Check if notebook exists
fab exists "Production.Workspace/ETL Pipeline.Notebook"

# Get notebook properties
fab get "Production.Workspace/ETL Pipeline.Notebook"

# Get with verbose details
fab get "Production.Workspace/ETL Pipeline.Notebook" -v

# Get only notebook ID
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "id"

Get Notebook Definition

# Get full notebook definition
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition"

# Save definition to file
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition" -o /tmp/notebook-def.json

# Get notebook content (cells)
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition.parts[?path=='notebook-content.py'].payload | [0]"

Exporting Notebooks

Export as IPYNB

# Export notebook
fab export "Production.Workspace/ETL Pipeline.Notebook" -o /tmp/notebooks

# This creates:
# /tmp/notebooks/ETL Pipeline.Notebook/
# ├── notebook-content.py (or .ipynb)
# └── metadata files

Export All Notebooks from Workspace

# Export all notebooks
WS_ID=$(fab get "Production.Workspace" -q "id")
NOTEBOOKS=$(fab api "workspaces/$WS_ID/items" -q "value[?type=='Notebook'].displayName")

for NOTEBOOK in $NOTEBOOKS; do
    fab export "Production.Workspace/$NOTEBOOK.Notebook" -o /tmp/notebooks
done

Importing Notebooks

Import from Local

# Import notebook from .ipynb format (default)
fab import "Production.Workspace/New ETL.Notebook" -i /tmp/notebooks/ETL\ Pipeline.Notebook

# Import from .py format
fab import "Production.Workspace/Script.Notebook" -i /tmp/script.py --format py

Copy Between Workspaces

# Copy notebook
fab cp "Dev.Workspace/ETL.Notebook" "Production.Workspace"

# Copy with new name
fab cp "Dev.Workspace/ETL.Notebook" "Production.Workspace/Prod ETL.Notebook"

Creating Notebooks

Create Blank Notebook

# Get workspace ID first
fab get "Production.Workspace" -q "id"

# Create via API
fab api -X post "workspaces/<workspace-id>/notebooks" -i '{"displayName": "New Data Processing"}'

Create and Configure Query Notebook

Use this workflow to create a notebook for querying lakehouse tables with Spark SQL.

Step 1: Create the notebook

# Get workspace ID
fab get "Sales.Workspace" -q "id"
# Returns: 4caf7825-81ac-4c94-9e46-306b4c20a4d5

# Create notebook
fab api -X post "workspaces/4caf7825-81ac-4c94-9e46-306b4c20a4d5/notebooks" -i '{"displayName": "Data Query"}'
# Returns notebook ID: 97bbd18d-c293-46b8-8536-82fb8bc9bd58

Step 2: Get lakehouse ID (required for notebook metadata)

fab get "Sales.Workspace/SalesLH.Lakehouse" -q "id"
# Returns: ddbcc575-805b-4922-84db-ca451b318755

Step 3: Create notebook code in Fabric format

cat > /tmp/notebook.py <<'EOF'
# Fabric notebook source

# METADATA ********************

# META {
# META   "kernel_info": {
# META     "name": "synapse_pyspark"
# META   },
# META   "dependencies": {
# META     "lakehouse": {
# META       "default_lakehouse": "ddbcc575-805b-4922-84db-ca451b318755",
# META       "default_lakehouse_name": "SalesLH",
# META       "default_lakehouse_workspace_id": "4caf7825-81ac-4c94-9e46-306b4c20a4d5"
# META     }
# META   }
# META }

# CELL ********************

# Query lakehouse table
df = spark.sql("""
    SELECT
        date_key,
        COUNT(*) as num_records
    FROM gold.sets
    GROUP BY date_key
    ORDER BY date_key DESC
    LIMIT 10
""")

# IMPORTANT: Convert to pandas and print to capture output
# display(df) will NOT show results via API
pandas_df = df.toPandas()
print(pandas_df)
print(f"\nLatest date: {pandas_df.iloc[0]['date_key']}")
EOF

Step 4: Base64 encode and create update definition

base64 -i /tmp/notebook.py > /tmp/notebook-b64.txt

cat > /tmp/update.json <<EOF
{
  "definition": {
    "parts": [
      {
        "path": "notebook-content.py",
        "payload": "$(cat /tmp/notebook-b64.txt)",
        "payloadType": "InlineBase64"
      }
    ]
  }
}
EOF

Step 5: Update notebook with code

fab api -X post "workspaces/4caf7825-81ac-4c94-9e46-306b4c20a4d5/notebooks/97bbd18d-c293-46b8-8536-82fb8bc9bd58/updateDefinition" -i /tmp/update.json --show_headers
# Returns operation ID in Location header

Step 6: Check update completed

fab api "operations/<operation-id>"
# Wait for status: "Succeeded"

Step 7: Run the notebook

fab job start "Sales.Workspace/Data Query.Notebook"
# Returns job instance ID

Step 8: Check execution status

fab job run-status "Sales.Workspace/Data Query.Notebook" --id <job-id>
# Wait for status: "Completed"

Step 9: Get results (download from Fabric UI)

  • Open notebook in Fabric UI after execution
  • Print output will be visible in cell outputs
  • Download .ipynb file to see printed results locally

Critical Requirements

  1. File format: Must be notebook-content.py (NOT .ipynb)
  2. Lakehouse ID: Must include default_lakehouse ID in metadata (not just name)
  3. Spark session: Will be automatically available when lakehouse is attached
  4. Capturing output: Use df.toPandas() and print() - display() won't show in API
  5. Results location: Print output visible in UI and downloaded .ipynb, NOT in definition

Common Issues

  • NameError: name 'spark' is not defined - Lakehouse not attached (missing default_lakehouse ID)
  • Job "Completed" but no results - Used display() instead of print()
  • Update fails - Used .ipynb path instead of .py

Create from Template

# Export template
fab export "Templates.Workspace/Template Notebook.Notebook" -o /tmp/templates

# Import as new notebook
fab import "Production.Workspace/Custom Notebook.Notebook" -i /tmp/templates/Template\ Notebook.Notebook

Running Notebooks

Run Synchronously (Wait for Completion)

# Run notebook and wait
fab job run "Production.Workspace/ETL Pipeline.Notebook"

# Run with timeout (seconds)
fab job run "Production.Workspace/Long Process.Notebook" --timeout 600

Run with Parameters

# Run with basic parameters
fab job run "Production.Workspace/ETL Pipeline.Notebook" -P \
  date:string=2024-01-01,\
  batch_size:int=1000,\
  debug_mode:bool=false,\
  threshold:float=0.95

# Parameters must match types defined in notebook
# Supported types: string, int, float, bool

Run with Spark Configuration

# Run with custom Spark settings
fab job run "Production.Workspace/Big Data Processing.Notebook" -C '{
  "conf": {
    "spark.executor.memory": "8g",
    "spark.executor.cores": "4",
    "spark.dynamicAllocation.enabled": "true"
  },
  "environment": {
    "id": "<environment-id>",
    "name": "Production Environment"
  }
}'

# Run with default lakehouse
fab job run "Production.Workspace/Data Ingestion.Notebook" -C '{
  "defaultLakehouse": {
    "name": "MainLakehouse",
    "id": "<lakehouse-id>",
    "workspaceId": "<workspace-id>"
  }
}'

# Run with workspace Spark pool
fab job run "Production.Workspace/Analytics.Notebook" -C '{
  "useStarterPool": false,
  "useWorkspacePool": "HighMemoryPool"
}'

Run with Combined Parameters and Configuration

# Combine parameters and configuration
fab job run "Production.Workspace/ETL Pipeline.Notebook" \
  -P date:string=2024-01-01,batch:int=500 \
  -C '{
    "defaultLakehouse": {"name": "StagingLH", "id": "<lakehouse-id>"},
    "conf": {"spark.sql.shuffle.partitions": "200"}
  }'

Run Asynchronously

# Start notebook and return immediately
JOB_ID=$(fab job start "Production.Workspace/ETL Pipeline.Notebook" | grep -o '"id": "[^"]*"' | cut -d'"' -f4)

# Check status later
fab job run-status "Production.Workspace/ETL Pipeline.Notebook" --id "$JOB_ID"

Monitoring Notebook Executions

Get Job Status

# Check specific job
fab job run-status "Production.Workspace/ETL Pipeline.Notebook" --id <job-id>

# Get detailed status via API
WS_ID=$(fab get "Production.Workspace" -q "id")
NOTEBOOK_ID=$(fab get "Production.Workspace/ETL Pipeline.Notebook" -q "id")
fab api "workspaces/$WS_ID/items/$NOTEBOOK_ID/jobs/instances/<job-id>"

List Execution History

# List all job runs
fab job run-list "Production.Workspace/ETL Pipeline.Notebook"

# List only scheduled runs
fab job run-list "Production.Workspace/ETL Pipeline.Notebook" --schedule

# Get latest run status
fab job run-list "Production.Workspace/ETL Pipeline.Notebook" | head -n 1

Cancel Running Job

fab job run-cancel "Production.Workspace/ETL Pipeline.Notebook" --id <job-id>

Scheduling Notebooks

Create Cron Schedule

# Run every 30 minutes
fab job run-sch "Production.Workspace/ETL Pipeline.Notebook" \
  --type cron \
  --interval 30 \
  --start 2024-11-15T00:00:00 \
  --end 2025-12-31T23:59:00 \
  --enable

Create Daily Schedule

# Run daily at 2 AM and 2 PM
fab job run-sch "Production.Workspace/ETL Pipeline.Notebook" \
  --type daily \
  --interval 02:00,14:00 \
  --start 2024-11-15T00:00:00 \
  --end 2025-12-31T23:59:00 \
  --enable

Create Weekly Schedule

# Run Monday and Friday at 9 AM
fab job run-sch "Production.Workspace/Weekly Report.Notebook" \
  --type weekly \
  --interval 09:00 \
  --days Monday,Friday \
  --start 2024-11-15T00:00:00 \
  --enable

Update Schedule

# Modify existing schedule
fab job run-update "Production.Workspace/ETL Pipeline.Notebook" \
  --id <schedule-id> \
  --type daily \
  --interval 03:00 \
  --enable

# Disable schedule
fab job run-update "Production.Workspace/ETL Pipeline.Notebook" \
  --id <schedule-id> \
  --disable

Notebook Configuration

Set Default Lakehouse

# Via notebook properties
fab set "Production.Workspace/ETL.Notebook" -q lakehouse -i '{
  "known_lakehouses": [{"id": "<lakehouse-id>"}],
  "default_lakehouse": "<lakehouse-id>",
  "default_lakehouse_name": "MainLakehouse",
  "default_lakehouse_workspace_id": "<workspace-id>"
}'

Set Default Environment

fab set "Production.Workspace/ETL.Notebook" -q environment -i '{
  "environmentId": "<environment-id>",
  "workspaceId": "<workspace-id>"
}'

Set Default Warehouse

fab set "Production.Workspace/Analytics.Notebook" -q warehouse -i '{
  "known_warehouses": [{"id": "<warehouse-id>", "type": "Datawarehouse"}],
  "default_warehouse": "<warehouse-id>"
}'

Updating Notebooks

Update Display Name

fab set "Production.Workspace/ETL.Notebook" -q displayName -i "ETL Pipeline v2"

Update Description

fab set "Production.Workspace/ETL.Notebook" -q description -i "Daily ETL pipeline for sales data ingestion and transformation"

Deleting Notebooks

# Delete with confirmation (interactive)
fab rm "Dev.Workspace/Old Notebook.Notebook"

# Force delete without confirmation
fab rm "Dev.Workspace/Old Notebook.Notebook" -f

Advanced Workflows

Parameterized Notebook Execution

# Create parametrized notebook with cell tagged as "parameters"
# In notebook, create cell:
date = "2024-01-01"  # default
batch_size = 1000    # default
debug = False        # default

# Execute with different parameters
fab job run "Production.Workspace/Parameterized.Notebook" -P \
  date:string=2024-02-15,\
  batch_size:int=2000,\
  debug:bool=true

Notebook Orchestration Pipeline

#!/bin/bash

WORKSPACE="Production.Workspace"
DATE=$(date +%Y-%m-%d)

# 1. Run ingestion notebook
echo "Starting data ingestion..."
fab job run "$WORKSPACE/1_Ingest_Data.Notebook" -P date:string=$DATE

# 2. Run transformation notebook
echo "Running transformations..."
fab job run "$WORKSPACE/2_Transform_Data.Notebook" -P date:string=$DATE

# 3. Run analytics notebook
echo "Generating analytics..."
fab job run "$WORKSPACE/3_Analytics.Notebook" -P date:string=$DATE

# 4. Run reporting notebook
echo "Creating reports..."
fab job run "$WORKSPACE/4_Reports.Notebook" -P date:string=$DATE

echo "Pipeline completed for $DATE"

Monitoring Long-Running Notebook

#!/bin/bash

NOTEBOOK="Production.Workspace/Long Process.Notebook"

# Start job
JOB_ID=$(fab job start "$NOTEBOOK" -P date:string=$(date +%Y-%m-%d) | \
  grep -o '"id": "[^"]*"' | head -1 | cut -d'"' -f4)

echo "Started job: $JOB_ID"

# Poll status every 30 seconds
while true; do
  STATUS=$(fab job run-status "$NOTEBOOK" --id "$JOB_ID" | \
    grep -o '"status": "[^"]*"' | cut -d'"' -f4)

  echo "[$(date +%H:%M:%S)] Status: $STATUS"

  if [[ "$STATUS" == "Completed" ]] || [[ "$STATUS" == "Failed" ]]; then
    break
  fi

  sleep 30
done

if [[ "$STATUS" == "Completed" ]]; then
  echo "Job completed successfully"
  exit 0
else
  echo "Job failed"
  exit 1
fi

Conditional Notebook Execution

#!/bin/bash

WORKSPACE="Production.Workspace"

# Check if data is ready
DATA_READY=$(fab api "workspaces/<ws-id>/lakehouses/<lh-id>/Files/ready.flag" 2>&1 | grep -c "200")

if [ "$DATA_READY" -eq 1 ]; then
  echo "Data ready, running notebook..."
  fab job run "$WORKSPACE/Process Data.Notebook" -P date:string=$(date +%Y-%m-%d)
else
  echo "Data not ready, skipping execution"
fi

Notebook Definition Structure

Notebook definition contains:

NotebookName.Notebook/ ├── .platform # Git integration metadata ├── notebook-content.py # Python code (or .ipynb format) └── metadata.json # Notebook metadata

Query Notebook Content

NOTEBOOK="Production.Workspace/ETL.Notebook"

# Get Python code content
fab get "$NOTEBOOK" -q "definition.parts[?path=='notebook-content.py'].payload | [0]" | base64 -d

# Get metadata
fab get "$NOTEBOOK" -q "definition.parts[?path=='metadata.json'].payload | [0]" | base64 -d | jq .

Troubleshooting

Notebook Execution Failures

# Check recent execution
fab job run-list "Production.Workspace/ETL.Notebook" | head -n 5

# Get detailed error
fab job run-status "Production.Workspace/ETL.Notebook" --id <job-id> -q "error"

# Common issues:
# - Lakehouse not attached
# - Invalid parameters
# - Spark configuration errors
# - Missing dependencies

Parameter Type Mismatches

# Parameters must match expected types
# ❌ Wrong: -P count:string=100    (should be int)
# ✅ Right: -P count:int=100

# Check notebook definition for parameter types
fab get "Production.Workspace/ETL.Notebook" -q "definition.parts[?path=='notebook-content.py']"

Lakehouse Access Issues

# Verify lakehouse exists and is accessible
fab exists "Production.Workspace/MainLakehouse.Lakehouse"

# Check notebook's lakehouse configuration
fab get "Production.Workspace/ETL.Notebook" -q "properties.lakehouse"

# Re-attach lakehouse
fab set "Production.Workspace/ETL.Notebook" -q lakehouse -i '{
  "known_lakehouses": [{"id": "<lakehouse-id>"}],
  "default_lakehouse": "<lakehouse-id>",
  "default_lakehouse_name": "MainLakehouse",
  "default_lakehouse_workspace_id": "<workspace-id>"
}'

Performance Tips

  1. Use workspace pools: Faster startup than starter pool
  2. Cache data in lakehouses: Avoid re-fetching data
  3. Parameterize notebooks: Reuse logic with different inputs
  4. Monitor execution time: Set appropriate timeouts
  5. Use async execution: Don't block on long-running notebooks
  6. Optimize Spark config: Tune for specific workloads

Best Practices

  1. Tag parameter cells: Use "parameters" tag for injected params
  2. Handle failures gracefully: Add error handling and logging
  3. Version control notebooks: Export and commit to Git
  4. Use descriptive names: Clear naming for scheduled jobs
  5. Document parameters: Add comments explaining expected inputs
  6. Test locally first: Validate in development workspace
  7. Monitor schedules: Review execution history regularly
  8. Clean up old notebooks: Remove unused notebooks

Security Considerations

  1. Credential management: Use Key Vault for secrets
  2. Workspace permissions: Control who can execute notebooks
  3. Parameter validation: Sanitize inputs in notebook code
  4. Data access: Respect lakehouse/warehouse permissions
  5. Logging: Don't log sensitive information
  • scripts/run_notebook_pipeline.py - Orchestrate multiple notebooks
  • scripts/monitor_notebook.py - Monitor long-running executions
  • scripts/export_notebook.py - Export with validation
  • scripts/schedule_notebook.py - Simplified scheduling interface