23 KiB
name, description
| name | description |
|---|---|
| dataform-engineering-fundamentals | Use when developing BigQuery Dataform transformations, SQLX files, source declarations, or troubleshooting pipelines - enforces TDD workflow (tests first), ALWAYS use ${ref()} never hardcoded table paths, comprehensive columns:{} documentation, safety practices (--schema-suffix dev, --dry-run), proper ref() syntax, .sqlx for new declarations, no schema config in operations/tests, and architecture patterns that prevent technical debt under time pressure |
Dataform Engineering Fundamentals
Overview
Core principle: Safety practices and proper architecture are NEVER optional in Dataform development, regardless of time pressure or business urgency.
REQUIRED FOUNDATION: This skill builds upon superpowers:test-driven-development. All TDD principles from that skill apply to Dataform development. This skill adapts TDD specifically for BigQuery Dataform SQLX files.
Official Documentation: For Dataform syntax, configuration options, and API reference, see https://cloud.google.com/dataform/docs
Best Practices Guide: For repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories
Time pressure does not justify skipping safety checks or creating technical debt. The time "saved" by shortcuts gets multiplied into hours of debugging, broken dependencies, and production issues.
When to Use
Use this skill for ANY Dataform work:
- Creating new SQLX transformations
- Modifying existing tables
- Adding data sources
- Troubleshooting pipeline failures
- "Quick" reports or ad-hoc analysis
Especially use when:
- Under time pressure or deadlines
- Stakeholders are waiting
- Working late at night (exhausted)
- Tempted to "just make it work"
Related Skills:
- Before designing new features: Use superpowers:brainstorming to refine requirements into clear designs before writing any code
- When troubleshooting failures: Use superpowers:systematic-debugging for structured problem-solving
- When debugging complex issues: Use superpowers:root-cause-tracing to trace errors back to their source
- When writing documentation, commit messages, or any prose: Use elements-of-style:writing-clearly-and-concisely to apply Strunk's timeless writing rules for clarity and conciseness
Non-Negotiable Safety Practices
These are ALWAYS required. No exceptions for deadlines, urgency, or "simple" tasks:
1. Always Use --schema-suffix dev for Testing
# WRONG: Testing in production
dataform run --actions my_table
# CORRECT: Test in dev first
dataform run --schema-suffix dev --actions my_table
Why: Writes to schema_dev.my_table instead of schema_prod.my_table (or adds _dev suffix based on your configuration). Allows safe testing without impacting production data or dashboards.
2. Always Use --dry-run Before Execution
# Check compilation
dataform compile
# Validate SQL without executing
dataform run --schema-suffix dev --dry-run --actions my_table
# Only then execute
dataform run --schema-suffix dev --actions my_table
Why: Catches SQL errors, missing dependencies, and cost estimation before using BigQuery slots.
3. Source Declarations Before ref()
WRONG: Using tables without source declarations
-- This will break dependency tracking
FROM `project_id.external_schema.table_name`
CORRECT: Create source declaration first
-- definitions/sources/external_system/table_name.sqlx
config {
type: "declaration",
database: "project_id",
schema: "external_schema",
name: "table_name"
}
-- Then reference it
FROM ${ref("table_name")}
4. ALWAYS Use ${ref()} - NEVER Hardcoded Table Paths
WRONG: Hardcoded table paths
-- NEVER do this
FROM `project.external_schema.table_name`
FROM `project.reporting_schema.customer_metrics`
SELECT * FROM project.source_schema.customers
CORRECT: Always use ${ref()}
-- Create source declaration first, then reference
FROM ${ref("table_name")}
FROM ${ref("customer_metrics")}
SELECT * FROM ${ref("customers")}
Why:
- Dataform tracks dependencies automatically with ref()
- Hardcoded paths break dependency graphs
- ref() enables --schema-suffix to work correctly
- Refactoring is easier when references are managed
Exception: None. There is NO valid reason to use hardcoded table paths in SQLX files.
5. Proper ref() Syntax
WRONG: Including schema in ref() unnecessarily
FROM ${ref("external_schema", "sales_order")}
CORRECT: Use single argument when source declared
FROM ${ref("sales_order")}
When to use two-argument ref():
- Source declarations that haven't been imported yet
- Special schema architectures where schema suffix behavior needs explicit control
- Cross-database references in multi-project setups
Why:
- Single-argument ref() works for most tables
- Dataform resolves the full path from source declarations
- Two-argument form is only needed for special cases
6. Basic Validation Queries
Always verify your output:
# Check row counts
bq query --use_legacy_sql=false \
"SELECT COUNT(*) FROM \`project.schema_dev.my_table\`"
# Check for nulls in critical fields
bq query --use_legacy_sql=false \
"SELECT COUNT(*) FROM \`project.schema_dev.my_table\`
WHERE key_field IS NULL"
Why: Catches silent failures (empty tables, null values, bad joins) immediately.
Architecture Patterns (Not Optional)
Even for "quick" work, follow these patterns:
Reference: For detailed guidance on repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories
Layered Structure
definitions/
sources/ # External data declarations
intermediate/ # Transformations and business logic
output/ # Final tables for consumption
reports/ # Reporting tables
marts/ # Data marts for specific use cases
Don't: Create monolithic queries directly in output layer
Do: Break into intermediate steps for reusability and testing
Incremental vs Full Refresh
config {
type: "incremental",
uniqueKey: "order_id",
bigquery: {
partitionBy: "DATE(order_date)",
clusterBy: ["customer_id", "product_id"]
}
}
When to use incremental: Tables that grow daily (events, transactions, logs)
When to use full refresh: Small dimension tables, aggregations with lookback windows
Dataform Assertions
config {
type: "table",
assertions: {
uniqueKey: ["call_id"],
nonNull: ["customer_phone_number", "start_time"],
rowConditions: ["duration >= 0"]
}
}
Why: Catches data quality issues automatically during pipeline runs.
Source Declarations: Prefer .sqlx Files
STRONGLY PREFER: .sqlx files for ALL new declarations
-- definitions/sources/external_system/table_name.sqlx
config {
type: "declaration",
database: "project_id",
schema: "external_schema",
name: "table_name",
columns: {
id: "Unique identifier for records",
// ... more columns
}
}
ACCEPTABLE (legacy only): .js files for existing declarations
// definitions/sources/legacy_declarations.js (existing file)
declare({
database: "project_id",
schema: "source_schema",
name: "customers"
});
Rule: ALL NEW source declarations MUST be .sqlx files. Existing .js declarations can remain but should be migrated to .sqlx when modifying them.
Why: .sqlx files support column documentation, are more maintainable, and integrate better with Dataform's dependency tracking.
Schema Configuration Rules
Operations: Files in definitions/operations/ should NOT include schema: config
-- CORRECT
config {
type: "operations",
tags: ["daily"]
}
-- WRONG
config {
type: "operations",
schema: "dataform", // DON'T specify schema
tags: ["daily"]
}
Tests/Assertions: Files in definitions/test/ should NOT include schema: config
-- CORRECT
config {
type: "assertion",
description: "Check for duplicates"
}
-- WRONG
config {
type: "assertion",
schema: "dataform_assertions", // DON'T specify schema
description: "Check for duplicates"
}
Why: Operations live in the default dataform schema and assertions live in dataform_assertions schema (configured in workflow_settings.yaml). Specifying schema explicitly can cause conflicts.
Documentation Standards (Non-Negotiable)
All tables with type: "table" MUST include comprehensive columns: {} documentation in the config block.
Writing Clear Documentation: When writing column descriptions, commit messages, or any prose that humans will read, use elements-of-style:writing-clearly-and-concisely to ensure clarity and conciseness.
columns: {} Requirement
WRONG: Table without column documentation
config {
type: "table",
schema: "reporting"
}
SELECT customer_id, total_revenue FROM ${ref("orders")}
CORRECT: Complete column documentation
config {
type: "table",
schema: "reporting",
columns: {
customer_id: "Unique customer identifier from source system",
total_revenue: "Sum of all order amounts in USD, excluding refunds"
}
}
SELECT customer_id, total_revenue FROM ${ref("orders")}
Where to Get Column Descriptions
Column descriptions should be derived from:
- Source Declarations: Copy descriptions from upstream source tables
- Third-party Documentation: Use official API documentation for external systems (CRM, ERP, analytics platforms)
- Business Logic: Document calculated fields, transformations, and business rules
- BI Tool Requirements: Include context that dashboard builders and analysts need
- Dataform Documentation: Reference https://cloud.google.com/dataform/docs for Dataform-specific configuration and built-in functions
Example with ERP source documentation:
config {
type: "table",
schema: "reporting",
columns: {
customer_id: "Unique customer identifier from ERP system",
customer_name: "Customer legal business name",
account_group: "Customer classification code for account management",
credit_limit: "Maximum allowed credit in USD"
}
}
Source Declarations Should Include columns:
When applicable, source declarations should also document columns:
-- definitions/sources/external_api/events.sqlx
config {
type: "declaration",
database: "project_id",
schema: "external_api",
name: "events",
description: "Event records from external API with enriched data",
columns: {
event_id: "Unique event identifier from API",
user_id: "User identifier who triggered the event",
event_type: "Type of event (click, view, purchase, etc.)",
timestamp: "UTC timestamp when event occurred",
properties: "JSON object containing event-specific properties"
}
}
Why document sources: Downstream tables inherit and extend these descriptions, creating documentation consistency across the pipeline.
Test-Driven Development (TDD) Workflow
REQUIRED BACKGROUND: You MUST understand and follow superpowers:test-driven-development
BEFORE TDD: When creating NEW features with unclear requirements, use superpowers:brainstorming FIRST to refine rough ideas into clear designs. Only start TDD once you have a clear understanding of what needs to be built.
When creating NEW features or tables in Dataform, apply the TDD cycle:
- RED: Write tests first, watch them fail
- GREEN: Write minimal code to make tests pass
- REFACTOR: Clean up while keeping tests passing
The superpowers:test-driven-development skill provides the foundational TDD principles. This section adapts those principles specifically for Dataform tables and SQLX files.
TDD for Dataform Tables
WRONG: Implementation-first approach
1. Write SQLX transformation
2. Test manually with bq query
3. "It works, ship it"
CORRECT: Test-first approach
1. Write data quality assertions first
2. Write unit tests for business logic
3. Run tests - they should FAIL (table doesn't exist yet)
4. Write SQLX transformation
5. Run tests - they should PASS
6. Refactor transformation if needed
Example TDD Workflow
Step 1: Write assertions first (definitions/assertions/assert_customer_metrics.sqlx)
config {
type: "assertion",
description: "Customer metrics must have valid data"
}
-- This WILL fail initially (table doesn't exist)
SELECT 'Duplicate customer_id' AS test
FROM ${ref("customer_metrics")}
GROUP BY customer_id
HAVING COUNT(*) > 1
UNION ALL
SELECT 'Negative lifetime value' AS test
FROM ${ref("customer_metrics")}
WHERE lifetime_value < 0
Step 2: Run tests - watch them fail
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# ERROR: Table customer_metrics does not exist ✓ EXPECTED
Step 3: Write minimal implementation (definitions/output/reports/customer_metrics.sqlx)
config {
type: "table",
schema: "reporting",
columns: {
customer_id: "Unique customer identifier",
lifetime_value: "Total revenue from customer in USD"
}
}
SELECT
customer_id,
SUM(order_total) AS lifetime_value
FROM ${ref("orders")}
GROUP BY customer_id
Step 4: Run tests - watch them pass
dataform run --schema-suffix dev --actions customer_metrics
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# No rows returned ✓ TESTS PASS
Why TDD Matters in Dataform
- Catches bugs before production: Tests fail when logic is wrong
- Documents expected behavior: Tests show what the table should do
- Prevents regressions: Future changes won't break existing logic
- Faster debugging: Test failures pinpoint exact issues
- Confidence in refactoring: Change code safely with test coverage
TDD Red Flags
If you're thinking:
- "I'll write tests after the implementation" → NO, write tests FIRST
- "Tests are overkill for this simple table" → NO, simple tables break too
- "I'll test manually with bq query" → NO, manual tests aren't repeatable
- "Tests after achieve the same result" → NO, tests-first catches design flaws
All of these mean: You're skipping TDD. Write tests first, then implementation.
See also: The superpowers:test-driven-development skill contains additional TDD rationalizations and red flags that apply universally to all code, including Dataform SQLX files.
Quick Reference
| Task | Command | Notes |
|---|---|---|
| Compile only | dataform compile |
Check syntax, no BigQuery execution |
| Dry run | dataform run --schema-suffix dev --dry-run --actions table_name |
Validate SQL, estimate cost |
| Test in dev | dataform run --schema-suffix dev --actions table_name |
Safe execution in dev environment |
| Run with dependencies | dataform run --schema-suffix dev --include-deps --actions table_name |
Run upstream dependencies first |
| Run by tag | dataform run --schema-suffix dev --tags looker |
Run all tables with tag |
| Production deploy | dataform run --actions table_name |
Only after dev testing succeeds |
Common Rationalizations (And Why They're Wrong)
| Excuse | Reality | Fix |
|---|---|---|
| "Too urgent to test in dev" | Production failures waste MORE time than dev testing | 3 minutes testing saves 60 minutes debugging |
| "It's just a quick report" | "Quick" reports become permanent tables | Use proper architecture from start |
| "Business is waiting" | Broken output wastes stakeholder time | Correct results delivered 10 minutes later > wrong results now |
| "Hardcoding table path is faster than ${ref()}" | Breaks dependency tracking, creates maintenance nightmare | Create source declaration, use ${ref()} (30 seconds) |
| "I'll refactor it later" | Technical debt rarely gets fixed | Do it right the first time (saves time overall) |
| "Correctness over elegance" | Architecture = maintainability, not elegance | Proper structure IS correctness |
| "I'll add tests after" | After = never | Write tests FIRST (TDD), then implementation |
| "I'll add documentation after" | After = never | Add columns: {} in config block immediately |
| "Working late, just need it working" | Exhaustion causes mistakes | Discipline matters MORE when tired |
| "Column docs are optional for internal tables" | All tables become external eventually | Document everything, always |
| "Tests after achieve same result" | Tests-after = checking what it does; tests-first = defining what it should do | TDD catches design flaws early |
Red Flags - STOP Immediately
If you're thinking any of these thoughts, STOP and follow the skill:
- "I'll skip
--schema-suffix devthis once" - "No time for
--dry-run" - "I'll just hardcode the table path instead of using ${ref()}"
- "I'll use backticks instead of ${ref()} (it's faster)"
- "I'll just create one file instead of intermediate layers"
- "Tests are optional for ad-hoc work"
- "I'll write tests after the implementation"
- "I'll add column documentation later"
- "This table doesn't need columns: {} block"
- "I'll use a .js file for declarations (faster to write)"
- "I'll add schema: config to this operation/test file"
- "I'll fix the technical debt later"
- "This is different because [business reason]"
All of these mean: You're about to create problems. Follow the non-negotiable practices.
Common Mistakes
Mistake 1: Using tables before declaring sources
-- WRONG: Direct table reference
FROM `project.external_schema.contacts`
-- CORRECT: Declare source first
FROM ${ref("contacts")}
Fix: Create source declaration in definitions/sources/ before using in queries.
Mistake 2: Mixing ref() with manual schema qualification
-- WRONG: When source exists
FROM ${ref("dataset_name", "table_name")}
-- CORRECT
FROM ${ref("table_name")}
Fix: Use single-argument ref() when source declaration exists. Dataform handles full path resolution.
Mistake 3: Skipping dev testing under pressure
Symptom: "I'll deploy directly to production because it's urgent"
Fix: --schema-suffix dev takes 30 seconds longer than production deploy. Production failures take hours to fix.
Mistake 4: Creating monolithic transformations
Symptom: 200-line SQLX file with 5 CTEs doing multiple transformations
Fix: Break into intermediate tables. Each table should do ONE transformation clearly.
Mistake 5: Missing columns: {} documentation
Symptom: Table config without column descriptions
Fix: Add comprehensive columns: {} block to EVERY table with type: "table". Get descriptions from source docs, upstream tables, or business logic.
Mistake 6: Writing implementation before tests
Symptom: Creating SQLX file, then adding assertions afterward (or never)
Fix: Follow TDD cycle - write assertions first, watch them fail, write implementation, watch tests pass.
Mistake 7: Using .js files for NEW source declarations
Symptom: Creating NEW definitions/sources/sources.js files with declare() functions
Fix: Create .sqlx files in definitions/sources/[system]/[table].sqlx with proper config blocks and column documentation. Existing .js files can remain until they need modification.
Mistake 8: Hardcoded table paths instead of ${ref()}
Symptom: Using backtick-quoted table paths in queries
FROM `project.external_api.events`
SELECT * FROM project.source_schema.customers
Fix: ALWAYS use ${ref()} after creating source declarations
FROM ${ref("events")}
SELECT * FROM ${ref("customers")}
Why critical: Hardcoded paths break dependency tracking, prevent --schema-suffix from working, and make refactoring impossible.
Mistake 9: Adding schema: config to operations or tests
Symptom: Operations or test files with explicit schema configuration
config {
type: "operations",
schema: "dataform", // Wrong!
}
Fix: Remove schema: config - operations and tests use default schemas from workflow_settings.yaml
Time Pressure Protocol
When under extreme time pressure (board meeting in 2 hours, production down, stakeholder waiting):
- ✅ Still use dev testing - 3 minutes saves 60 minutes debugging
- ✅ Still use --dry-run - Catches errors before wasting BigQuery slots
- ✅ Still create source declarations - Broken dependencies waste MORE time
- ✅ Still add columns: {} documentation - Takes 2 minutes, saves hours explaining to Looker users
- ✅ Still write tests first (TDD) - 5 minutes writing assertions prevents production bugs
- ✅ Still do basic validation - Wrong results are worse than delayed results
- ⚠️ Can skip: Extensive documentation files, peer review, performance optimization
- ⚠️ Must document: Tag as "technical_debt", create TODO with follow-up tasks
The bottom line: Safety practices save time. Skipping them wastes time. Even under pressure.
Troubleshooting Dataform Errors
RECOMMENDED APPROACH: When encountering ANY bug, test failure, or unexpected behavior, use superpowers:systematic-debugging before attempting fixes. For errors deep in execution or cascading failures, use superpowers:root-cause-tracing to identify the original trigger.
Official Reference: For Dataform-specific errors, configuration issues, or syntax questions, consult https://cloud.google.com/dataform/docs
"Table not found" errors
Quick fixes:
- Check source declaration exists in
definitions/sources/ - Verify ref() syntax (single argument if source exists)
- Check schema/database match in source config
- Run
dataform compileto see resolved SQL
If issue persists: Use superpowers:systematic-debugging for structured root cause investigation.
Dependency cycle errors
Quick fixes:
- Use
${ref("table_name")}not direct table references - Check for circular dependencies (A → B → A)
- Review dependency graph in Dataform UI
If issue persists: Use superpowers:root-cause-tracing to trace the dependency chain back to the source of the cycle.
Timeout errors
Quick fixes:
- Add partitioning/clustering to config
- Use incremental updates instead of full refresh
- Break large transformations into smaller intermediate tables
If issue persists: Use superpowers:systematic-debugging to investigate query performance systematically.
Real-World Impact
Scenario: "Quick" report created without source declarations, skipping dev testing.
Cost:
- 10 minutes saved initially
- 2 hours debugging "table not found" errors in production
- 3 stakeholder escalations
- 1 broken morning dashboard
- Net loss: 110 minutes
With proper practices:
- 13 minutes total (3 extra for dev testing)
- Zero production issues
- Zero escalations
- Net gain: 97 minutes
Takeaway: Discipline is faster than shortcuts.