zhongwei/gh-ihistand-claude-plugins-dataform-toolkit

Fork 0

Files

Zhongwei Li b317d592e2 Initial commit

2025-11-29 18:48:10 +08:00

23 KiB

Raw Blame History

name, description

name	description
dataform-engineering-fundamentals	Use when developing BigQuery Dataform transformations, SQLX files, source declarations, or troubleshooting pipelines - enforces TDD workflow (tests first), ALWAYS use ${ref()} never hardcoded table paths, comprehensive columns:{} documentation, safety practices (--schema-suffix dev, --dry-run), proper ref() syntax, .sqlx for new declarations, no schema config in operations/tests, and architecture patterns that prevent technical debt under time pressure

name

description

dataform-engineering-fundamentals

Use when developing BigQuery Dataform transformations, SQLX files, source declarations, or troubleshooting pipelines - enforces TDD workflow (tests first), ALWAYS use ${ref()} never hardcoded table paths, comprehensive columns:{} documentation, safety practices (--schema-suffix dev, --dry-run), proper ref() syntax, .sqlx for new declarations, no schema config in operations/tests, and architecture patterns that prevent technical debt under time pressure

Dataform Engineering Fundamentals

Overview

Core principle: Safety practices and proper architecture are NEVER optional in Dataform development, regardless of time pressure or business urgency.

REQUIRED FOUNDATION: This skill builds upon superpowers:test-driven-development. All TDD principles from that skill apply to Dataform development. This skill adapts TDD specifically for BigQuery Dataform SQLX files.

Official Documentation: For Dataform syntax, configuration options, and API reference, see https://cloud.google.com/dataform/docs

Best Practices Guide: For repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories

Time pressure does not justify skipping safety checks or creating technical debt. The time "saved" by shortcuts gets multiplied into hours of debugging, broken dependencies, and production issues.

When to Use

Use this skill for ANY Dataform work:

Creating new SQLX transformations
Modifying existing tables
Adding data sources
Troubleshooting pipeline failures
"Quick" reports or ad-hoc analysis

Especially use when:

Under time pressure or deadlines
Stakeholders are waiting
Working late at night (exhausted)
Tempted to "just make it work"

Related Skills:

Before designing new features: Use superpowers:brainstorming to refine requirements into clear designs before writing any code
When troubleshooting failures: Use superpowers:systematic-debugging for structured problem-solving
When debugging complex issues: Use superpowers:root-cause-tracing to trace errors back to their source
When writing documentation, commit messages, or any prose: Use elements-of-style:writing-clearly-and-concisely to apply Strunk's timeless writing rules for clarity and conciseness

Non-Negotiable Safety Practices

These are ALWAYS required. No exceptions for deadlines, urgency, or "simple" tasks:

1. Always Use `--schema-suffix dev` for Testing

# WRONG: Testing in production
dataform run --actions my_table

# CORRECT: Test in dev first
dataform run --schema-suffix dev --actions my_table

Why: Writes to schema_dev.my_table instead of schema_prod.my_table (or adds _dev suffix based on your configuration). Allows safe testing without impacting production data or dashboards.

2. Always Use `--dry-run` Before Execution

# Check compilation
dataform compile

# Validate SQL without executing
dataform run --schema-suffix dev --dry-run --actions my_table

# Only then execute
dataform run --schema-suffix dev --actions my_table

Why: Catches SQL errors, missing dependencies, and cost estimation before using BigQuery slots.

3. Source Declarations Before ref()

WRONG: Using tables without source declarations

-- This will break dependency tracking
FROM `project_id.external_schema.table_name`

CORRECT: Create source declaration first

-- definitions/sources/external_system/table_name.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_schema",
  name: "table_name"
}

-- Then reference it
FROM ${ref("table_name")}

4. ALWAYS Use ${ref()} - NEVER Hardcoded Table Paths

WRONG: Hardcoded table paths

-- NEVER do this
FROM `project.external_schema.table_name`
FROM `project.reporting_schema.customer_metrics`
SELECT * FROM project.source_schema.customers

CORRECT: Always use ${ref()}

-- Create source declaration first, then reference
FROM ${ref("table_name")}
FROM ${ref("customer_metrics")}
SELECT * FROM ${ref("customers")}

Why:

Dataform tracks dependencies automatically with ref()
Hardcoded paths break dependency graphs
ref() enables --schema-suffix to work correctly
Refactoring is easier when references are managed

Exception: None. There is NO valid reason to use hardcoded table paths in SQLX files.

5. Proper ref() Syntax

WRONG: Including schema in ref() unnecessarily

FROM ${ref("external_schema", "sales_order")}

CORRECT: Use single argument when source declared

FROM ${ref("sales_order")}

When to use two-argument ref():

Source declarations that haven't been imported yet
Special schema architectures where schema suffix behavior needs explicit control
Cross-database references in multi-project setups

Why:

Single-argument ref() works for most tables
Dataform resolves the full path from source declarations
Two-argument form is only needed for special cases

6. Basic Validation Queries

Always verify your output:

# Check row counts
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM \`project.schema_dev.my_table\`"

# Check for nulls in critical fields
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM \`project.schema_dev.my_table\`
   WHERE key_field IS NULL"

Why: Catches silent failures (empty tables, null values, bad joins) immediately.

Architecture Patterns (Not Optional)

Even for "quick" work, follow these patterns:

Reference: For detailed guidance on repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories

Layered Structure

definitions/
  sources/          # External data declarations
  intermediate/     # Transformations and business logic
  output/           # Final tables for consumption
    reports/        # Reporting tables
    marts/          # Data marts for specific use cases

Don't: Create monolithic queries directly in output layer

Do: Break into intermediate steps for reusability and testing

Incremental vs Full Refresh

config {
  type: "incremental",
  uniqueKey: "order_id",
  bigquery: {
    partitionBy: "DATE(order_date)",
    clusterBy: ["customer_id", "product_id"]
  }
}

When to use incremental: Tables that grow daily (events, transactions, logs)

When to use full refresh: Small dimension tables, aggregations with lookback windows

Dataform Assertions

config {
  type: "table",
  assertions: {
    uniqueKey: ["call_id"],
    nonNull: ["customer_phone_number", "start_time"],
    rowConditions: ["duration >= 0"]
  }
}

Why: Catches data quality issues automatically during pipeline runs.

Source Declarations: Prefer .sqlx Files

STRONGLY PREFER: .sqlx files for ALL new declarations

-- definitions/sources/external_system/table_name.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_schema",
  name: "table_name",
  columns: {
    id: "Unique identifier for records",
    // ... more columns
  }
}

ACCEPTABLE (legacy only): .js files for existing declarations

// definitions/sources/legacy_declarations.js (existing file)
declare({
  database: "project_id",
  schema: "source_schema",
  name: "customers"
});

Rule: ALL NEW source declarations MUST be .sqlx files. Existing .js declarations can remain but should be migrated to .sqlx when modifying them.

Why: .sqlx files support column documentation, are more maintainable, and integrate better with Dataform's dependency tracking.

Schema Configuration Rules

Operations: Files in definitions/operations/ should NOT include schema: config

-- CORRECT
config {
  type: "operations",
  tags: ["daily"]
}

-- WRONG
config {
  type: "operations",
  schema: "dataform",  // DON'T specify schema
  tags: ["daily"]
}

Tests/Assertions: Files in definitions/test/ should NOT include schema: config

-- CORRECT
config {
  type: "assertion",
  description: "Check for duplicates"
}

-- WRONG
config {
  type: "assertion",
  schema: "dataform_assertions",  // DON'T specify schema
  description: "Check for duplicates"
}

Why: Operations live in the default dataform schema and assertions live in dataform_assertions schema (configured in workflow_settings.yaml). Specifying schema explicitly can cause conflicts.

Documentation Standards (Non-Negotiable)

All tables with type: "table" MUST include comprehensive columns: {} documentation in the config block.

Writing Clear Documentation: When writing column descriptions, commit messages, or any prose that humans will read, use elements-of-style:writing-clearly-and-concisely to ensure clarity and conciseness.

columns: {} Requirement

WRONG: Table without column documentation

config {
  type: "table",
  schema: "reporting"
}

SELECT customer_id, total_revenue FROM ${ref("orders")}

CORRECT: Complete column documentation

config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier from source system",
    total_revenue: "Sum of all order amounts in USD, excluding refunds"
  }
}

SELECT customer_id, total_revenue FROM ${ref("orders")}

Where to Get Column Descriptions

Column descriptions should be derived from:

Source Declarations: Copy descriptions from upstream source tables
Third-party Documentation: Use official API documentation for external systems (CRM, ERP, analytics platforms)
Business Logic: Document calculated fields, transformations, and business rules
BI Tool Requirements: Include context that dashboard builders and analysts need
Dataform Documentation: Reference https://cloud.google.com/dataform/docs for Dataform-specific configuration and built-in functions

Example with ERP source documentation:

config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier from ERP system",
    customer_name: "Customer legal business name",
    account_group: "Customer classification code for account management",
    credit_limit: "Maximum allowed credit in USD"
  }
}

Source Declarations Should Include columns:

When applicable, source declarations should also document columns:

-- definitions/sources/external_api/events.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_api",
  name: "events",
  description: "Event records from external API with enriched data",
  columns: {
    event_id: "Unique event identifier from API",
    user_id: "User identifier who triggered the event",
    event_type: "Type of event (click, view, purchase, etc.)",
    timestamp: "UTC timestamp when event occurred",
    properties: "JSON object containing event-specific properties"
  }
}

Why document sources: Downstream tables inherit and extend these descriptions, creating documentation consistency across the pipeline.

Test-Driven Development (TDD) Workflow

REQUIRED BACKGROUND: You MUST understand and follow superpowers:test-driven-development

BEFORE TDD: When creating NEW features with unclear requirements, use superpowers:brainstorming FIRST to refine rough ideas into clear designs. Only start TDD once you have a clear understanding of what needs to be built.

When creating NEW features or tables in Dataform, apply the TDD cycle:

RED: Write tests first, watch them fail
GREEN: Write minimal code to make tests pass
REFACTOR: Clean up while keeping tests passing

The superpowers:test-driven-development skill provides the foundational TDD principles. This section adapts those principles specifically for Dataform tables and SQLX files.

TDD for Dataform Tables

WRONG: Implementation-first approach

1. Write SQLX transformation
2. Test manually with bq query
3. "It works, ship it"

CORRECT: Test-first approach

1. Write data quality assertions first
2. Write unit tests for business logic
3. Run tests - they should FAIL (table doesn't exist yet)
4. Write SQLX transformation
5. Run tests - they should PASS
6. Refactor transformation if needed

Example TDD Workflow

Step 1: Write assertions first (definitions/assertions/assert_customer_metrics.sqlx)

config {
  type: "assertion",
  description: "Customer metrics must have valid data"
}

-- This WILL fail initially (table doesn't exist)
SELECT 'Duplicate customer_id' AS test
FROM ${ref("customer_metrics")}
GROUP BY customer_id
HAVING COUNT(*) > 1

UNION ALL

SELECT 'Negative lifetime value' AS test
FROM ${ref("customer_metrics")}
WHERE lifetime_value < 0

Step 2: Run tests - watch them fail

dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# ERROR: Table customer_metrics does not exist ✓ EXPECTED

Step 3: Write minimal implementation (definitions/output/reports/customer_metrics.sqlx)

config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier",
    lifetime_value: "Total revenue from customer in USD"
  }
}

SELECT
  customer_id,
  SUM(order_total) AS lifetime_value
FROM ${ref("orders")}
GROUP BY customer_id

Step 4: Run tests - watch them pass

dataform run --schema-suffix dev --actions customer_metrics
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# No rows returned ✓ TESTS PASS

Why TDD Matters in Dataform

Catches bugs before production: Tests fail when logic is wrong
Documents expected behavior: Tests show what the table should do
Prevents regressions: Future changes won't break existing logic
Faster debugging: Test failures pinpoint exact issues
Confidence in refactoring: Change code safely with test coverage

TDD Red Flags

If you're thinking:

"I'll write tests after the implementation" → NO, write tests FIRST
"Tests are overkill for this simple table" → NO, simple tables break too
"I'll test manually with bq query" → NO, manual tests aren't repeatable
"Tests after achieve the same result" → NO, tests-first catches design flaws

All of these mean: You're skipping TDD. Write tests first, then implementation.

See also: The superpowers:test-driven-development skill contains additional TDD rationalizations and red flags that apply universally to all code, including Dataform SQLX files.

Quick Reference

Task	Command	Notes
Compile only	`dataform compile`	Check syntax, no BigQuery execution
Dry run	`dataform run --schema-suffix dev --dry-run --actions table_name`	Validate SQL, estimate cost
Test in dev	`dataform run --schema-suffix dev --actions table_name`	Safe execution in dev environment
Run with dependencies	`dataform run --schema-suffix dev --include-deps --actions table_name`	Run upstream dependencies first
Run by tag	`dataform run --schema-suffix dev --tags looker`	Run all tables with tag
Production deploy	`dataform run --actions table_name`	Only after dev testing succeeds

Common Rationalizations (And Why They're Wrong)

Excuse	Reality	Fix
"Too urgent to test in dev"	Production failures waste MORE time than dev testing	3 minutes testing saves 60 minutes debugging
"It's just a quick report"	"Quick" reports become permanent tables	Use proper architecture from start
"Business is waiting"	Broken output wastes stakeholder time	Correct results delivered 10 minutes later > wrong results now
"Hardcoding table path is faster than ${ref()}"	Breaks dependency tracking, creates maintenance nightmare	Create source declaration, use ${ref()} (30 seconds)
"I'll refactor it later"	Technical debt rarely gets fixed	Do it right the first time (saves time overall)
"Correctness over elegance"	Architecture = maintainability, not elegance	Proper structure IS correctness
"I'll add tests after"	After = never	Write tests FIRST (TDD), then implementation
"I'll add documentation after"	After = never	Add columns: {} in config block immediately
"Working late, just need it working"	Exhaustion causes mistakes	Discipline matters MORE when tired
"Column docs are optional for internal tables"	All tables become external eventually	Document everything, always
"Tests after achieve same result"	Tests-after = checking what it does; tests-first = defining what it should do	TDD catches design flaws early

Red Flags - STOP Immediately

If you're thinking any of these thoughts, STOP and follow the skill:

"I'll skip --schema-suffix dev this once"
"No time for --dry-run"
"I'll just hardcode the table path instead of using ${ref()}"
"I'll use backticks instead of ${ref()} (it's faster)"
"I'll just create one file instead of intermediate layers"
"Tests are optional for ad-hoc work"
"I'll write tests after the implementation"
"I'll add column documentation later"
"This table doesn't need columns: {} block"
"I'll use a .js file for declarations (faster to write)"
"I'll add schema: config to this operation/test file"
"I'll fix the technical debt later"
"This is different because [business reason]"

All of these mean: You're about to create problems. Follow the non-negotiable practices.

Common Mistakes

Mistake 1: Using tables before declaring sources

-- WRONG: Direct table reference
FROM `project.external_schema.contacts`

-- CORRECT: Declare source first
FROM ${ref("contacts")}

Fix: Create source declaration in definitions/sources/ before using in queries.

Mistake 2: Mixing ref() with manual schema qualification

-- WRONG: When source exists
FROM ${ref("dataset_name", "table_name")}

-- CORRECT
FROM ${ref("table_name")}

Fix: Use single-argument ref() when source declaration exists. Dataform handles full path resolution.

Mistake 3: Skipping dev testing under pressure

Symptom: "I'll deploy directly to production because it's urgent"

Fix: --schema-suffix dev takes 30 seconds longer than production deploy. Production failures take hours to fix.

Mistake 4: Creating monolithic transformations

Symptom: 200-line SQLX file with 5 CTEs doing multiple transformations

Fix: Break into intermediate tables. Each table should do ONE transformation clearly.

Mistake 5: Missing columns: {} documentation

Symptom: Table config without column descriptions

Fix: Add comprehensive columns: {} block to EVERY table with type: "table". Get descriptions from source docs, upstream tables, or business logic.

Mistake 6: Writing implementation before tests

Symptom: Creating SQLX file, then adding assertions afterward (or never)

Fix: Follow TDD cycle - write assertions first, watch them fail, write implementation, watch tests pass.

Mistake 7: Using .js files for NEW source declarations

Symptom: Creating NEW definitions/sources/sources.js files with declare() functions

Fix: Create .sqlx files in definitions/sources/[system]/[table].sqlx with proper config blocks and column documentation. Existing .js files can remain until they need modification.

Mistake 8: Hardcoded table paths instead of ${ref()}

Symptom: Using backtick-quoted table paths in queries

FROM `project.external_api.events`
SELECT * FROM project.source_schema.customers

Fix: ALWAYS use ${ref()} after creating source declarations

FROM ${ref("events")}
SELECT * FROM ${ref("customers")}

Why critical: Hardcoded paths break dependency tracking, prevent --schema-suffix from working, and make refactoring impossible.

Mistake 9: Adding schema: config to operations or tests

Symptom: Operations or test files with explicit schema configuration

config {
  type: "operations",
  schema: "dataform",  // Wrong!
}

Fix: Remove schema: config - operations and tests use default schemas from workflow_settings.yaml

Time Pressure Protocol

When under extreme time pressure (board meeting in 2 hours, production down, stakeholder waiting):

✅ Still use dev testing - 3 minutes saves 60 minutes debugging
✅ Still use --dry-run - Catches errors before wasting BigQuery slots
✅ Still create source declarations - Broken dependencies waste MORE time
✅ Still add columns: {} documentation - Takes 2 minutes, saves hours explaining to Looker users
✅ Still write tests first (TDD) - 5 minutes writing assertions prevents production bugs
✅ Still do basic validation - Wrong results are worse than delayed results
⚠️ Can skip: Extensive documentation files, peer review, performance optimization
⚠️ Must document: Tag as "technical_debt", create TODO with follow-up tasks

The bottom line: Safety practices save time. Skipping them wastes time. Even under pressure.

Troubleshooting Dataform Errors

RECOMMENDED APPROACH: When encountering ANY bug, test failure, or unexpected behavior, use superpowers:systematic-debugging before attempting fixes. For errors deep in execution or cascading failures, use superpowers:root-cause-tracing to identify the original trigger.

Official Reference: For Dataform-specific errors, configuration issues, or syntax questions, consult https://cloud.google.com/dataform/docs

"Table not found" errors

Quick fixes:

Check source declaration exists in definitions/sources/
Verify ref() syntax (single argument if source exists)
Check schema/database match in source config
Run dataform compile to see resolved SQL

If issue persists: Use superpowers:systematic-debugging for structured root cause investigation.

Dependency cycle errors

Quick fixes:

Use ${ref("table_name")} not direct table references
Check for circular dependencies (A → B → A)
Review dependency graph in Dataform UI

If issue persists: Use superpowers:root-cause-tracing to trace the dependency chain back to the source of the cycle.

Timeout errors

Quick fixes:

Add partitioning/clustering to config
Use incremental updates instead of full refresh
Break large transformations into smaller intermediate tables

If issue persists: Use superpowers:systematic-debugging to investigate query performance systematically.

Real-World Impact

Scenario: "Quick" report created without source declarations, skipping dev testing.

Cost:

10 minutes saved initially
2 hours debugging "table not found" errors in production
3 stakeholder escalations
1 broken morning dashboard
Net loss: 110 minutes

With proper practices:

13 minutes total (3 extra for dev testing)
Zero production issues
Zero escalations
Net gain: 97 minutes

Takeaway: Discipline is faster than shortcuts.

23 KiB Raw Blame History

Dataform Engineering Fundamentals

Overview

When to Use

Non-Negotiable Safety Practices

1. Always Use --schema-suffix dev for Testing

2. Always Use --dry-run Before Execution

3. Source Declarations Before ref()

4. ALWAYS Use ${ref()} - NEVER Hardcoded Table Paths

5. Proper ref() Syntax

6. Basic Validation Queries

Architecture Patterns (Not Optional)

Layered Structure

Incremental vs Full Refresh

Dataform Assertions

Source Declarations: Prefer .sqlx Files

Schema Configuration Rules

Documentation Standards (Non-Negotiable)

columns: {} Requirement

Where to Get Column Descriptions

Source Declarations Should Include columns:

Test-Driven Development (TDD) Workflow

TDD for Dataform Tables

Example TDD Workflow

Why TDD Matters in Dataform

TDD Red Flags

Quick Reference

Common Rationalizations (And Why They're Wrong)

Red Flags - STOP Immediately

Common Mistakes

Mistake 1: Using tables before declaring sources

Mistake 2: Mixing ref() with manual schema qualification

Mistake 3: Skipping dev testing under pressure

Mistake 4: Creating monolithic transformations

Mistake 5: Missing columns: {} documentation

Mistake 6: Writing implementation before tests

Mistake 7: Using .js files for NEW source declarations

Mistake 8: Hardcoded table paths instead of ${ref()}

Mistake 9: Adding schema: config to operations or tests

Time Pressure Protocol

Troubleshooting Dataform Errors

"Table not found" errors

Dependency cycle errors

Timeout errors

Real-World Impact

23 KiB

Raw Blame History

1. Always Use `--schema-suffix dev` for Testing

2. Always Use `--dry-run` Before Execution