Files
gh-data-goblin-fabric-cli-p…/skills/fabric-cli/references/notebooks.md
2025-11-29 18:18:03 +08:00

667 lines
16 KiB
Markdown

# Notebook Operations
Comprehensive guide for working with Fabric notebooks using the Fabric CLI.
## Overview
Fabric notebooks are interactive documents for data engineering, data science, and analytics. They can be executed, scheduled, and managed via the CLI.
## Getting Notebook Information
### Basic Notebook Info
```bash
# Check if notebook exists
fab exists "Production.Workspace/ETL Pipeline.Notebook"
# Get notebook properties
fab get "Production.Workspace/ETL Pipeline.Notebook"
# Get with verbose details
fab get "Production.Workspace/ETL Pipeline.Notebook" -v
# Get only notebook ID
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "id"
```
### Get Notebook Definition
```bash
# Get full notebook definition
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition"
# Save definition to file
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition" -o /tmp/notebook-def.json
# Get notebook content (cells)
fab get "Production.Workspace/ETL Pipeline.Notebook" -q "definition.parts[?path=='notebook-content.py'].payload | [0]"
```
## Exporting Notebooks
### Export as IPYNB
```bash
# Export notebook
fab export "Production.Workspace/ETL Pipeline.Notebook" -o /tmp/notebooks
# This creates:
# /tmp/notebooks/ETL Pipeline.Notebook/
# ├── notebook-content.py (or .ipynb)
# └── metadata files
```
### Export All Notebooks from Workspace
```bash
# Export all notebooks
WS_ID=$(fab get "Production.Workspace" -q "id")
NOTEBOOKS=$(fab api "workspaces/$WS_ID/items" -q "value[?type=='Notebook'].displayName")
for NOTEBOOK in $NOTEBOOKS; do
fab export "Production.Workspace/$NOTEBOOK.Notebook" -o /tmp/notebooks
done
```
## Importing Notebooks
### Import from Local
```bash
# Import notebook from .ipynb format (default)
fab import "Production.Workspace/New ETL.Notebook" -i /tmp/notebooks/ETL\ Pipeline.Notebook
# Import from .py format
fab import "Production.Workspace/Script.Notebook" -i /tmp/script.py --format py
```
### Copy Between Workspaces
```bash
# Copy notebook
fab cp "Dev.Workspace/ETL.Notebook" "Production.Workspace"
# Copy with new name
fab cp "Dev.Workspace/ETL.Notebook" "Production.Workspace/Prod ETL.Notebook"
```
## Creating Notebooks
### Create Blank Notebook
```bash
# Get workspace ID first
fab get "Production.Workspace" -q "id"
# Create via API
fab api -X post "workspaces/<workspace-id>/notebooks" -i '{"displayName": "New Data Processing"}'
```
### Create and Configure Query Notebook
Use this workflow to create a notebook for querying lakehouse tables with Spark SQL.
#### Step 1: Create the notebook
```bash
# Get workspace ID
fab get "Sales.Workspace" -q "id"
# Returns: 4caf7825-81ac-4c94-9e46-306b4c20a4d5
# Create notebook
fab api -X post "workspaces/4caf7825-81ac-4c94-9e46-306b4c20a4d5/notebooks" -i '{"displayName": "Data Query"}'
# Returns notebook ID: 97bbd18d-c293-46b8-8536-82fb8bc9bd58
```
#### Step 2: Get lakehouse ID (required for notebook metadata)
```bash
fab get "Sales.Workspace/SalesLH.Lakehouse" -q "id"
# Returns: ddbcc575-805b-4922-84db-ca451b318755
```
#### Step 3: Create notebook code in Fabric format
```bash
cat > /tmp/notebook.py <<'EOF'
# Fabric notebook source
# METADATA ********************
# META {
# META "kernel_info": {
# META "name": "synapse_pyspark"
# META },
# META "dependencies": {
# META "lakehouse": {
# META "default_lakehouse": "ddbcc575-805b-4922-84db-ca451b318755",
# META "default_lakehouse_name": "SalesLH",
# META "default_lakehouse_workspace_id": "4caf7825-81ac-4c94-9e46-306b4c20a4d5"
# META }
# META }
# META }
# CELL ********************
# Query lakehouse table
df = spark.sql("""
SELECT
date_key,
COUNT(*) as num_records
FROM gold.sets
GROUP BY date_key
ORDER BY date_key DESC
LIMIT 10
""")
# IMPORTANT: Convert to pandas and print to capture output
# display(df) will NOT show results via API
pandas_df = df.toPandas()
print(pandas_df)
print(f"\nLatest date: {pandas_df.iloc[0]['date_key']}")
EOF
```
#### Step 4: Base64 encode and create update definition
```bash
base64 -i /tmp/notebook.py > /tmp/notebook-b64.txt
cat > /tmp/update.json <<EOF
{
"definition": {
"parts": [
{
"path": "notebook-content.py",
"payload": "$(cat /tmp/notebook-b64.txt)",
"payloadType": "InlineBase64"
}
]
}
}
EOF
```
#### Step 5: Update notebook with code
```bash
fab api -X post "workspaces/4caf7825-81ac-4c94-9e46-306b4c20a4d5/notebooks/97bbd18d-c293-46b8-8536-82fb8bc9bd58/updateDefinition" -i /tmp/update.json --show_headers
# Returns operation ID in Location header
```
#### Step 6: Check update completed
```bash
fab api "operations/<operation-id>"
# Wait for status: "Succeeded"
```
#### Step 7: Run the notebook
```bash
fab job start "Sales.Workspace/Data Query.Notebook"
# Returns job instance ID
```
#### Step 8: Check execution status
```bash
fab job run-status "Sales.Workspace/Data Query.Notebook" --id <job-id>
# Wait for status: "Completed"
```
#### Step 9: Get results (download from Fabric UI)
- Open notebook in Fabric UI after execution
- Print output will be visible in cell outputs
- Download .ipynb file to see printed results locally
#### Critical Requirements
1. **File format**: Must be `notebook-content.py` (NOT `.ipynb`)
2. **Lakehouse ID**: Must include `default_lakehouse` ID in metadata (not just name)
3. **Spark session**: Will be automatically available when lakehouse is attached
4. **Capturing output**: Use `df.toPandas()` and `print()` - `display()` won't show in API
5. **Results location**: Print output visible in UI and downloaded .ipynb, NOT in definition
#### Common Issues
- `NameError: name 'spark' is not defined` - Lakehouse not attached (missing default_lakehouse ID)
- Job "Completed" but no results - Used display() instead of print()
- Update fails - Used .ipynb path instead of .py
### Create from Template
```bash
# Export template
fab export "Templates.Workspace/Template Notebook.Notebook" -o /tmp/templates
# Import as new notebook
fab import "Production.Workspace/Custom Notebook.Notebook" -i /tmp/templates/Template\ Notebook.Notebook
```
## Running Notebooks
### Run Synchronously (Wait for Completion)
```bash
# Run notebook and wait
fab job run "Production.Workspace/ETL Pipeline.Notebook"
# Run with timeout (seconds)
fab job run "Production.Workspace/Long Process.Notebook" --timeout 600
```
### Run with Parameters
```bash
# Run with basic parameters
fab job run "Production.Workspace/ETL Pipeline.Notebook" -P \
date:string=2024-01-01,\
batch_size:int=1000,\
debug_mode:bool=false,\
threshold:float=0.95
# Parameters must match types defined in notebook
# Supported types: string, int, float, bool
```
### Run with Spark Configuration
```bash
# Run with custom Spark settings
fab job run "Production.Workspace/Big Data Processing.Notebook" -C '{
"conf": {
"spark.executor.memory": "8g",
"spark.executor.cores": "4",
"spark.dynamicAllocation.enabled": "true"
},
"environment": {
"id": "<environment-id>",
"name": "Production Environment"
}
}'
# Run with default lakehouse
fab job run "Production.Workspace/Data Ingestion.Notebook" -C '{
"defaultLakehouse": {
"name": "MainLakehouse",
"id": "<lakehouse-id>",
"workspaceId": "<workspace-id>"
}
}'
# Run with workspace Spark pool
fab job run "Production.Workspace/Analytics.Notebook" -C '{
"useStarterPool": false,
"useWorkspacePool": "HighMemoryPool"
}'
```
### Run with Combined Parameters and Configuration
```bash
# Combine parameters and configuration
fab job run "Production.Workspace/ETL Pipeline.Notebook" \
-P date:string=2024-01-01,batch:int=500 \
-C '{
"defaultLakehouse": {"name": "StagingLH", "id": "<lakehouse-id>"},
"conf": {"spark.sql.shuffle.partitions": "200"}
}'
```
### Run Asynchronously
```bash
# Start notebook and return immediately
JOB_ID=$(fab job start "Production.Workspace/ETL Pipeline.Notebook" | grep -o '"id": "[^"]*"' | cut -d'"' -f4)
# Check status later
fab job run-status "Production.Workspace/ETL Pipeline.Notebook" --id "$JOB_ID"
```
## Monitoring Notebook Executions
### Get Job Status
```bash
# Check specific job
fab job run-status "Production.Workspace/ETL Pipeline.Notebook" --id <job-id>
# Get detailed status via API
WS_ID=$(fab get "Production.Workspace" -q "id")
NOTEBOOK_ID=$(fab get "Production.Workspace/ETL Pipeline.Notebook" -q "id")
fab api "workspaces/$WS_ID/items/$NOTEBOOK_ID/jobs/instances/<job-id>"
```
### List Execution History
```bash
# List all job runs
fab job run-list "Production.Workspace/ETL Pipeline.Notebook"
# List only scheduled runs
fab job run-list "Production.Workspace/ETL Pipeline.Notebook" --schedule
# Get latest run status
fab job run-list "Production.Workspace/ETL Pipeline.Notebook" | head -n 1
```
### Cancel Running Job
```bash
fab job run-cancel "Production.Workspace/ETL Pipeline.Notebook" --id <job-id>
```
## Scheduling Notebooks
### Create Cron Schedule
```bash
# Run every 30 minutes
fab job run-sch "Production.Workspace/ETL Pipeline.Notebook" \
--type cron \
--interval 30 \
--start 2024-11-15T00:00:00 \
--end 2025-12-31T23:59:00 \
--enable
```
### Create Daily Schedule
```bash
# Run daily at 2 AM and 2 PM
fab job run-sch "Production.Workspace/ETL Pipeline.Notebook" \
--type daily \
--interval 02:00,14:00 \
--start 2024-11-15T00:00:00 \
--end 2025-12-31T23:59:00 \
--enable
```
### Create Weekly Schedule
```bash
# Run Monday and Friday at 9 AM
fab job run-sch "Production.Workspace/Weekly Report.Notebook" \
--type weekly \
--interval 09:00 \
--days Monday,Friday \
--start 2024-11-15T00:00:00 \
--enable
```
### Update Schedule
```bash
# Modify existing schedule
fab job run-update "Production.Workspace/ETL Pipeline.Notebook" \
--id <schedule-id> \
--type daily \
--interval 03:00 \
--enable
# Disable schedule
fab job run-update "Production.Workspace/ETL Pipeline.Notebook" \
--id <schedule-id> \
--disable
```
## Notebook Configuration
### Set Default Lakehouse
```bash
# Via notebook properties
fab set "Production.Workspace/ETL.Notebook" -q lakehouse -i '{
"known_lakehouses": [{"id": "<lakehouse-id>"}],
"default_lakehouse": "<lakehouse-id>",
"default_lakehouse_name": "MainLakehouse",
"default_lakehouse_workspace_id": "<workspace-id>"
}'
```
### Set Default Environment
```bash
fab set "Production.Workspace/ETL.Notebook" -q environment -i '{
"environmentId": "<environment-id>",
"workspaceId": "<workspace-id>"
}'
```
### Set Default Warehouse
```bash
fab set "Production.Workspace/Analytics.Notebook" -q warehouse -i '{
"known_warehouses": [{"id": "<warehouse-id>", "type": "Datawarehouse"}],
"default_warehouse": "<warehouse-id>"
}'
```
## Updating Notebooks
### Update Display Name
```bash
fab set "Production.Workspace/ETL.Notebook" -q displayName -i "ETL Pipeline v2"
```
### Update Description
```bash
fab set "Production.Workspace/ETL.Notebook" -q description -i "Daily ETL pipeline for sales data ingestion and transformation"
```
## Deleting Notebooks
```bash
# Delete with confirmation (interactive)
fab rm "Dev.Workspace/Old Notebook.Notebook"
# Force delete without confirmation
fab rm "Dev.Workspace/Old Notebook.Notebook" -f
```
## Advanced Workflows
### Parameterized Notebook Execution
```python
# Create parametrized notebook with cell tagged as "parameters"
# In notebook, create cell:
date = "2024-01-01" # default
batch_size = 1000 # default
debug = False # default
# Execute with different parameters
fab job run "Production.Workspace/Parameterized.Notebook" -P \
date:string=2024-02-15,\
batch_size:int=2000,\
debug:bool=true
```
### Notebook Orchestration Pipeline
```bash
#!/bin/bash
WORKSPACE="Production.Workspace"
DATE=$(date +%Y-%m-%d)
# 1. Run ingestion notebook
echo "Starting data ingestion..."
fab job run "$WORKSPACE/1_Ingest_Data.Notebook" -P date:string=$DATE
# 2. Run transformation notebook
echo "Running transformations..."
fab job run "$WORKSPACE/2_Transform_Data.Notebook" -P date:string=$DATE
# 3. Run analytics notebook
echo "Generating analytics..."
fab job run "$WORKSPACE/3_Analytics.Notebook" -P date:string=$DATE
# 4. Run reporting notebook
echo "Creating reports..."
fab job run "$WORKSPACE/4_Reports.Notebook" -P date:string=$DATE
echo "Pipeline completed for $DATE"
```
### Monitoring Long-Running Notebook
```bash
#!/bin/bash
NOTEBOOK="Production.Workspace/Long Process.Notebook"
# Start job
JOB_ID=$(fab job start "$NOTEBOOK" -P date:string=$(date +%Y-%m-%d) | \
grep -o '"id": "[^"]*"' | head -1 | cut -d'"' -f4)
echo "Started job: $JOB_ID"
# Poll status every 30 seconds
while true; do
STATUS=$(fab job run-status "$NOTEBOOK" --id "$JOB_ID" | \
grep -o '"status": "[^"]*"' | cut -d'"' -f4)
echo "[$(date +%H:%M:%S)] Status: $STATUS"
if [[ "$STATUS" == "Completed" ]] || [[ "$STATUS" == "Failed" ]]; then
break
fi
sleep 30
done
if [[ "$STATUS" == "Completed" ]]; then
echo "Job completed successfully"
exit 0
else
echo "Job failed"
exit 1
fi
```
### Conditional Notebook Execution
```bash
#!/bin/bash
WORKSPACE="Production.Workspace"
# Check if data is ready
DATA_READY=$(fab api "workspaces/<ws-id>/lakehouses/<lh-id>/Files/ready.flag" 2>&1 | grep -c "200")
if [ "$DATA_READY" -eq 1 ]; then
echo "Data ready, running notebook..."
fab job run "$WORKSPACE/Process Data.Notebook" -P date:string=$(date +%Y-%m-%d)
else
echo "Data not ready, skipping execution"
fi
```
## Notebook Definition Structure
Notebook definition contains:
NotebookName.Notebook/
├── .platform # Git integration metadata
├── notebook-content.py # Python code (or .ipynb format)
└── metadata.json # Notebook metadata
### Query Notebook Content
```bash
NOTEBOOK="Production.Workspace/ETL.Notebook"
# Get Python code content
fab get "$NOTEBOOK" -q "definition.parts[?path=='notebook-content.py'].payload | [0]" | base64 -d
# Get metadata
fab get "$NOTEBOOK" -q "definition.parts[?path=='metadata.json'].payload | [0]" | base64 -d | jq .
```
## Troubleshooting
### Notebook Execution Failures
```bash
# Check recent execution
fab job run-list "Production.Workspace/ETL.Notebook" | head -n 5
# Get detailed error
fab job run-status "Production.Workspace/ETL.Notebook" --id <job-id> -q "error"
# Common issues:
# - Lakehouse not attached
# - Invalid parameters
# - Spark configuration errors
# - Missing dependencies
```
### Parameter Type Mismatches
```bash
# Parameters must match expected types
# ❌ Wrong: -P count:string=100 (should be int)
# ✅ Right: -P count:int=100
# Check notebook definition for parameter types
fab get "Production.Workspace/ETL.Notebook" -q "definition.parts[?path=='notebook-content.py']"
```
### Lakehouse Access Issues
```bash
# Verify lakehouse exists and is accessible
fab exists "Production.Workspace/MainLakehouse.Lakehouse"
# Check notebook's lakehouse configuration
fab get "Production.Workspace/ETL.Notebook" -q "properties.lakehouse"
# Re-attach lakehouse
fab set "Production.Workspace/ETL.Notebook" -q lakehouse -i '{
"known_lakehouses": [{"id": "<lakehouse-id>"}],
"default_lakehouse": "<lakehouse-id>",
"default_lakehouse_name": "MainLakehouse",
"default_lakehouse_workspace_id": "<workspace-id>"
}'
```
## Performance Tips
1. **Use workspace pools**: Faster startup than starter pool
2. **Cache data in lakehouses**: Avoid re-fetching data
3. **Parameterize notebooks**: Reuse logic with different inputs
4. **Monitor execution time**: Set appropriate timeouts
5. **Use async execution**: Don't block on long-running notebooks
6. **Optimize Spark config**: Tune for specific workloads
## Best Practices
1. **Tag parameter cells**: Use "parameters" tag for injected params
2. **Handle failures gracefully**: Add error handling and logging
3. **Version control notebooks**: Export and commit to Git
4. **Use descriptive names**: Clear naming for scheduled jobs
5. **Document parameters**: Add comments explaining expected inputs
6. **Test locally first**: Validate in development workspace
7. **Monitor schedules**: Review execution history regularly
8. **Clean up old notebooks**: Remove unused notebooks
## Security Considerations
1. **Credential management**: Use Key Vault for secrets
2. **Workspace permissions**: Control who can execute notebooks
3. **Parameter validation**: Sanitize inputs in notebook code
4. **Data access**: Respect lakehouse/warehouse permissions
5. **Logging**: Don't log sensitive information
## Related Scripts
- `scripts/run_notebook_pipeline.py` - Orchestrate multiple notebooks
- `scripts/monitor_notebook.py` - Monitor long-running executions
- `scripts/export_notebook.py` - Export with validation
- `scripts/schedule_notebook.py` - Simplified scheduling interface