Initial commit
This commit is contained in:
16
.claude-plugin/plugin.json
Normal file
16
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,16 @@
|
||||
{
|
||||
"name": "dataform-toolkit",
|
||||
"description": "Comprehensive toolkit for BigQuery Dataform development with engineering best practices. Enforces TDD workflow, proper ref() usage, comprehensive documentation, and safe development patterns. Includes quick-access commands for common ETL workflows.",
|
||||
"version": "1.0.0",
|
||||
"author": {
|
||||
"name": "Ivan Histand",
|
||||
"email": "ihistand@rotoplas.com",
|
||||
"url": "https://github.com/ihistand"
|
||||
},
|
||||
"skills": [
|
||||
"./skills"
|
||||
],
|
||||
"commands": [
|
||||
"./commands"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# dataform-toolkit
|
||||
|
||||
Comprehensive toolkit for BigQuery Dataform development with engineering best practices. Enforces TDD workflow, proper ref() usage, comprehensive documentation, and safe development patterns. Includes quick-access commands for common ETL workflows.
|
||||
21
commands/dataform-deploy.md
Normal file
21
commands/dataform-deploy.md
Normal file
@@ -0,0 +1,21 @@
|
||||
---
|
||||
description: Deploy tested Dataform table to production
|
||||
---
|
||||
|
||||
You are deploying a Dataform table to production using best practices from the dataform-engineering-fundamentals skill.
|
||||
|
||||
**Workflow:**
|
||||
|
||||
1. Invoke the dataform-engineering-fundamentals skill
|
||||
2. Ask the user which table they want to deploy
|
||||
3. **Pre-deployment verification:**
|
||||
- Confirm the table has been tested in dev environment
|
||||
- Verify all tests are passing
|
||||
- Check that documentation (columns: {}) is complete
|
||||
4. **Deployment:**
|
||||
- Run `dataform run --dry-run --actions <table_name>` (production dry-run)
|
||||
- If successful, run `dataform run --actions <table_name>` (production execution)
|
||||
- Verify deployment with validation queries
|
||||
5. Report deployment results
|
||||
|
||||
**Critical**: Never deploy without dev testing first. Wrong results delivered quickly are worse than correct results delivered with a small delay.
|
||||
24
commands/dataform-etl.md
Normal file
24
commands/dataform-etl.md
Normal file
@@ -0,0 +1,24 @@
|
||||
---
|
||||
description: Launch ETL agent for BigQuery Dataform development
|
||||
---
|
||||
|
||||
You are launching the ETL Dataform engineer agent to handle data transformation pipeline work.
|
||||
|
||||
**Purpose**: The ETL agent specializes in BigQuery Dataform projects, SQLX files, data quality, and pipeline development. Use this for:
|
||||
- Complex Dataform transformations
|
||||
- Pipeline troubleshooting
|
||||
- Data quality implementations
|
||||
- ELT workflow coordination
|
||||
|
||||
**Task**: Use the Task tool with `subagent_type="etl"` to launch the ETL agent. Pass the user's request as the prompt, including:
|
||||
- What they need to accomplish
|
||||
- Any relevant context about tables, datasets, or business logic
|
||||
- Whether this is new development, modification, or troubleshooting
|
||||
|
||||
The ETL agent has access to the dataform-engineering-fundamentals skill and will follow best practices for BigQuery Dataform development.
|
||||
|
||||
**Example**:
|
||||
```
|
||||
User asks: "Help me create a customer metrics table"
|
||||
You launch: Task tool with subagent_type="etl" and prompt="Create a customer metrics table in Dataform following TDD workflow. Ask user about required metrics and data sources."
|
||||
```
|
||||
33
commands/dataform-new-table.md
Normal file
33
commands/dataform-new-table.md
Normal file
@@ -0,0 +1,33 @@
|
||||
---
|
||||
description: Create new Dataform table using TDD workflow
|
||||
---
|
||||
|
||||
You are creating a new Dataform table following the Test-Driven Development (TDD) workflow from the dataform-engineering-fundamentals skill.
|
||||
|
||||
**Workflow:**
|
||||
|
||||
1. Invoke the dataform-engineering-fundamentals skill
|
||||
2. Ask the user about the table requirements:
|
||||
- Table name and purpose
|
||||
- Expected columns and their descriptions
|
||||
- Data sources (for creating source declarations if needed)
|
||||
- Business logic and transformations
|
||||
3. **RED Phase - Write tests first:**
|
||||
- Create assertion file in `definitions/assertions/`
|
||||
- Write data quality tests (duplicates, nulls, invalid values)
|
||||
- Run tests - they should FAIL (table doesn't exist yet)
|
||||
4. **GREEN Phase - Write minimal implementation:**
|
||||
- Create source declarations if needed
|
||||
- Create table SQLX file with:
|
||||
- Proper config block with type, schema
|
||||
- Complete columns: {} documentation
|
||||
- SQL transformation
|
||||
- Run table creation: `dataform run --schema-suffix dev --actions <table_name>`
|
||||
- Run tests - they should PASS
|
||||
5. **REFACTOR Phase - Improve while keeping tests passing:**
|
||||
- Optimize query performance if needed
|
||||
- Add partitioning/clustering if appropriate
|
||||
- Improve documentation clarity
|
||||
6. Report completion with file locations and next steps
|
||||
|
||||
**Critical**: Always write tests FIRST, then implementation. Tests-after means you're checking what it does, not defining what it should do.
|
||||
18
commands/dataform-test.md
Normal file
18
commands/dataform-test.md
Normal file
@@ -0,0 +1,18 @@
|
||||
---
|
||||
description: Test Dataform table in dev environment with safety checks
|
||||
---
|
||||
|
||||
You are testing a Dataform table using best practices from the dataform-engineering-fundamentals skill.
|
||||
|
||||
**Workflow:**
|
||||
|
||||
1. Invoke the dataform-engineering-fundamentals skill
|
||||
2. Ask the user which table they want to test
|
||||
3. Follow the safety checklist:
|
||||
- Run `dataform compile` to check syntax
|
||||
- Run `dataform run --schema-suffix dev --dry-run --actions <table_name>` to validate SQL
|
||||
- Run `dataform run --schema-suffix dev --actions <table_name>` to execute in dev
|
||||
- Run basic validation queries to verify results
|
||||
4. Report results and any issues found
|
||||
|
||||
**Critical**: Always use `--schema-suffix dev` for testing. Never test directly in production.
|
||||
61
plugin.lock.json
Normal file
61
plugin.lock.json
Normal file
@@ -0,0 +1,61 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:ihistand/claude-plugins:dataform-toolkit",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "e6a636048c3bff11a032604ed8e612d6079ef75e",
|
||||
"treeHash": "4e905a8c3fcb1cb4ea7483eacc456319fedc9a71f085b8911c35ce88c517ac48",
|
||||
"generatedAt": "2025-11-28T10:17:39.224711Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "dataform-toolkit",
|
||||
"description": "Comprehensive toolkit for BigQuery Dataform development with engineering best practices. Enforces TDD workflow, proper ref() usage, comprehensive documentation, and safe development patterns. Includes quick-access commands for common ETL workflows.",
|
||||
"version": "1.0.0"
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "7f16ecada38db3846022fa1c6b3fdd21735f2636a1ab283b0f47b9962503f3db"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "c06b33b8c1ae866c964c1ab0bf108ba3ba595d481b820602eefac4dd1e0bcf1c"
|
||||
},
|
||||
{
|
||||
"path": "commands/dataform-test.md",
|
||||
"sha256": "3d63082fec4d207dd69025cfc255eaca7dc7914e7f292bdd7d567930f9cfac17"
|
||||
},
|
||||
{
|
||||
"path": "commands/dataform-new-table.md",
|
||||
"sha256": "44b5b4f4357e91d8f49b9e370df6658f13d53388f11ddb03a5158ceb2ca96eb9"
|
||||
},
|
||||
{
|
||||
"path": "commands/dataform-etl.md",
|
||||
"sha256": "d3bc64ebbff09f2f489da91806dfc0b8f892f472f93766bb1ea3b401a99df04a"
|
||||
},
|
||||
{
|
||||
"path": "commands/dataform-deploy.md",
|
||||
"sha256": "6496fa0a56fffaa0e9f0669efa1261d2d6117a85a281be522ec3599779609a25"
|
||||
},
|
||||
{
|
||||
"path": "skills/dataform-engineering-fundamentals.md",
|
||||
"sha256": "5d0e9bf09c2eed1c7ad7fd443ec04372670c07bbf4e6187fdc24d03fcdb35eb8"
|
||||
}
|
||||
],
|
||||
"dirSha256": "4e905a8c3fcb1cb4ea7483eacc456319fedc9a71f085b8911c35ce88c517ac48"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
661
skills/dataform-engineering-fundamentals.md
Normal file
661
skills/dataform-engineering-fundamentals.md
Normal file
@@ -0,0 +1,661 @@
|
||||
---
|
||||
name: dataform-engineering-fundamentals
|
||||
description: Use when developing BigQuery Dataform transformations, SQLX files, source declarations, or troubleshooting pipelines - enforces TDD workflow (tests first), ALWAYS use ${ref()} never hardcoded table paths, comprehensive columns:{} documentation, safety practices (--schema-suffix dev, --dry-run), proper ref() syntax, .sqlx for new declarations, no schema config in operations/tests, and architecture patterns that prevent technical debt under time pressure
|
||||
---
|
||||
|
||||
# Dataform Engineering Fundamentals
|
||||
|
||||
## Overview
|
||||
|
||||
**Core principle**: Safety practices and proper architecture are NEVER optional in Dataform development, regardless of time pressure or business urgency.
|
||||
|
||||
**REQUIRED FOUNDATION:** This skill builds upon superpowers:test-driven-development. All TDD principles from that skill apply to Dataform development. This skill adapts TDD specifically for BigQuery Dataform SQLX files.
|
||||
|
||||
**Official Documentation:** For Dataform syntax, configuration options, and API reference, see https://cloud.google.com/dataform/docs
|
||||
|
||||
**Best Practices Guide:** For repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories
|
||||
|
||||
Time pressure does not justify skipping safety checks or creating technical debt. The time "saved" by shortcuts gets multiplied into hours of debugging, broken dependencies, and production issues.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use this skill for ANY Dataform work:
|
||||
- Creating new SQLX transformations
|
||||
- Modifying existing tables
|
||||
- Adding data sources
|
||||
- Troubleshooting pipeline failures
|
||||
- "Quick" reports or ad-hoc analysis
|
||||
|
||||
**Especially** use when:
|
||||
- Under time pressure or deadlines
|
||||
- Stakeholders are waiting
|
||||
- Working late at night (exhausted)
|
||||
- Tempted to "just make it work"
|
||||
|
||||
**Related Skills**:
|
||||
- **Before designing new features**: Use superpowers:brainstorming to refine requirements into clear designs before writing any code
|
||||
- **When troubleshooting failures**: Use superpowers:systematic-debugging for structured problem-solving
|
||||
- **When debugging complex issues**: Use superpowers:root-cause-tracing to trace errors back to their source
|
||||
- **When writing documentation, commit messages, or any prose**: Use elements-of-style:writing-clearly-and-concisely to apply Strunk's timeless writing rules for clarity and conciseness
|
||||
|
||||
## Non-Negotiable Safety Practices
|
||||
|
||||
These are ALWAYS required. No exceptions for deadlines, urgency, or "simple" tasks:
|
||||
|
||||
### 1. Always Use `--schema-suffix dev` for Testing
|
||||
|
||||
```bash
|
||||
# WRONG: Testing in production
|
||||
dataform run --actions my_table
|
||||
|
||||
# CORRECT: Test in dev first
|
||||
dataform run --schema-suffix dev --actions my_table
|
||||
```
|
||||
|
||||
**Why**: Writes to `schema_dev.my_table` instead of `schema_prod.my_table` (or adds `_dev` suffix based on your configuration). Allows safe testing without impacting production data or dashboards.
|
||||
|
||||
### 2. Always Use `--dry-run` Before Execution
|
||||
|
||||
```bash
|
||||
# Check compilation
|
||||
dataform compile
|
||||
|
||||
# Validate SQL without executing
|
||||
dataform run --schema-suffix dev --dry-run --actions my_table
|
||||
|
||||
# Only then execute
|
||||
dataform run --schema-suffix dev --actions my_table
|
||||
```
|
||||
|
||||
**Why**: Catches SQL errors, missing dependencies, and cost estimation before using BigQuery slots.
|
||||
|
||||
### 3. Source Declarations Before ref()
|
||||
|
||||
**WRONG**: Using tables without source declarations
|
||||
```sql
|
||||
-- This will break dependency tracking
|
||||
FROM `project_id.external_schema.table_name`
|
||||
```
|
||||
|
||||
**CORRECT**: Create source declaration first
|
||||
```sql
|
||||
-- definitions/sources/external_system/table_name.sqlx
|
||||
config {
|
||||
type: "declaration",
|
||||
database: "project_id",
|
||||
schema: "external_schema",
|
||||
name: "table_name"
|
||||
}
|
||||
|
||||
-- Then reference it
|
||||
FROM ${ref("table_name")}
|
||||
```
|
||||
|
||||
### 4. ALWAYS Use ${ref()} - NEVER Hardcoded Table Paths
|
||||
|
||||
**WRONG**: Hardcoded table paths
|
||||
```sql
|
||||
-- NEVER do this
|
||||
FROM `project.external_schema.table_name`
|
||||
FROM `project.reporting_schema.customer_metrics`
|
||||
SELECT * FROM project.source_schema.customers
|
||||
```
|
||||
|
||||
**CORRECT**: Always use ${ref()}
|
||||
```sql
|
||||
-- Create source declaration first, then reference
|
||||
FROM ${ref("table_name")}
|
||||
FROM ${ref("customer_metrics")}
|
||||
SELECT * FROM ${ref("customers")}
|
||||
```
|
||||
|
||||
**Why**:
|
||||
- Dataform tracks dependencies automatically with ref()
|
||||
- Hardcoded paths break dependency graphs
|
||||
- ref() enables --schema-suffix to work correctly
|
||||
- Refactoring is easier when references are managed
|
||||
|
||||
**Exception**: None. There is NO valid reason to use hardcoded table paths in SQLX files.
|
||||
|
||||
### 5. Proper ref() Syntax
|
||||
|
||||
**WRONG**: Including schema in ref() unnecessarily
|
||||
```sql
|
||||
FROM ${ref("external_schema", "sales_order")}
|
||||
```
|
||||
|
||||
**CORRECT**: Use single argument when source declared
|
||||
```sql
|
||||
FROM ${ref("sales_order")}
|
||||
```
|
||||
|
||||
**When to use two-argument ref()**:
|
||||
- Source declarations that haven't been imported yet
|
||||
- Special schema architectures where schema suffix behavior needs explicit control
|
||||
- Cross-database references in multi-project setups
|
||||
|
||||
**Why**:
|
||||
- Single-argument ref() works for most tables
|
||||
- Dataform resolves the full path from source declarations
|
||||
- Two-argument form is only needed for special cases
|
||||
|
||||
### 6. Basic Validation Queries
|
||||
|
||||
Always verify your output:
|
||||
```bash
|
||||
# Check row counts
|
||||
bq query --use_legacy_sql=false \
|
||||
"SELECT COUNT(*) FROM \`project.schema_dev.my_table\`"
|
||||
|
||||
# Check for nulls in critical fields
|
||||
bq query --use_legacy_sql=false \
|
||||
"SELECT COUNT(*) FROM \`project.schema_dev.my_table\`
|
||||
WHERE key_field IS NULL"
|
||||
```
|
||||
|
||||
**Why**: Catches silent failures (empty tables, null values, bad joins) immediately.
|
||||
|
||||
## Architecture Patterns (Not Optional)
|
||||
|
||||
Even for "quick" work, follow these patterns:
|
||||
|
||||
**Reference:** For detailed guidance on repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories
|
||||
|
||||
### Layered Structure
|
||||
|
||||
```
|
||||
definitions/
|
||||
sources/ # External data declarations
|
||||
intermediate/ # Transformations and business logic
|
||||
output/ # Final tables for consumption
|
||||
reports/ # Reporting tables
|
||||
marts/ # Data marts for specific use cases
|
||||
```
|
||||
|
||||
**Don't**: Create monolithic queries directly in output layer
|
||||
|
||||
**Do**: Break into intermediate steps for reusability and testing
|
||||
|
||||
### Incremental vs Full Refresh
|
||||
|
||||
```sql
|
||||
config {
|
||||
type: "incremental",
|
||||
uniqueKey: "order_id",
|
||||
bigquery: {
|
||||
partitionBy: "DATE(order_date)",
|
||||
clusterBy: ["customer_id", "product_id"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**When to use incremental**: Tables that grow daily (events, transactions, logs)
|
||||
|
||||
**When to use full refresh**: Small dimension tables, aggregations with lookback windows
|
||||
|
||||
### Dataform Assertions
|
||||
|
||||
```sql
|
||||
config {
|
||||
type: "table",
|
||||
assertions: {
|
||||
uniqueKey: ["call_id"],
|
||||
nonNull: ["customer_phone_number", "start_time"],
|
||||
rowConditions: ["duration >= 0"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Why**: Catches data quality issues automatically during pipeline runs.
|
||||
|
||||
### Source Declarations: Prefer .sqlx Files
|
||||
|
||||
**STRONGLY PREFER**: .sqlx files for ALL new declarations
|
||||
```sql
|
||||
-- definitions/sources/external_system/table_name.sqlx
|
||||
config {
|
||||
type: "declaration",
|
||||
database: "project_id",
|
||||
schema: "external_schema",
|
||||
name: "table_name",
|
||||
columns: {
|
||||
id: "Unique identifier for records",
|
||||
// ... more columns
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**ACCEPTABLE (legacy only)**: .js files for existing declarations
|
||||
```javascript
|
||||
// definitions/sources/legacy_declarations.js (existing file)
|
||||
declare({
|
||||
database: "project_id",
|
||||
schema: "source_schema",
|
||||
name: "customers"
|
||||
});
|
||||
```
|
||||
|
||||
**Rule**: ALL NEW source declarations MUST be .sqlx files. Existing .js declarations can remain but should be migrated to .sqlx when modifying them.
|
||||
|
||||
**Why**: .sqlx files support column documentation, are more maintainable, and integrate better with Dataform's dependency tracking.
|
||||
|
||||
### Schema Configuration Rules
|
||||
|
||||
**Operations**: Files in `definitions/operations/` should NOT include `schema:` config
|
||||
```sql
|
||||
-- CORRECT
|
||||
config {
|
||||
type: "operations",
|
||||
tags: ["daily"]
|
||||
}
|
||||
|
||||
-- WRONG
|
||||
config {
|
||||
type: "operations",
|
||||
schema: "dataform", // DON'T specify schema
|
||||
tags: ["daily"]
|
||||
}
|
||||
```
|
||||
|
||||
**Tests/Assertions**: Files in `definitions/test/` should NOT include `schema:` config
|
||||
```sql
|
||||
-- CORRECT
|
||||
config {
|
||||
type: "assertion",
|
||||
description: "Check for duplicates"
|
||||
}
|
||||
|
||||
-- WRONG
|
||||
config {
|
||||
type: "assertion",
|
||||
schema: "dataform_assertions", // DON'T specify schema
|
||||
description: "Check for duplicates"
|
||||
}
|
||||
```
|
||||
|
||||
**Why**: Operations live in the default `dataform` schema and assertions live in `dataform_assertions` schema (configured in `workflow_settings.yaml`). Specifying schema explicitly can cause conflicts.
|
||||
|
||||
## Documentation Standards (Non-Negotiable)
|
||||
|
||||
All tables with `type: "table"` MUST include comprehensive `columns: {}` documentation in the config block.
|
||||
|
||||
**Writing Clear Documentation**: When writing column descriptions, commit messages, or any prose that humans will read, use elements-of-style:writing-clearly-and-concisely to ensure clarity and conciseness.
|
||||
|
||||
### columns: {} Requirement
|
||||
|
||||
**WRONG**: Table without column documentation
|
||||
```sql
|
||||
config {
|
||||
type: "table",
|
||||
schema: "reporting"
|
||||
}
|
||||
|
||||
SELECT customer_id, total_revenue FROM ${ref("orders")}
|
||||
```
|
||||
|
||||
**CORRECT**: Complete column documentation
|
||||
```sql
|
||||
config {
|
||||
type: "table",
|
||||
schema: "reporting",
|
||||
columns: {
|
||||
customer_id: "Unique customer identifier from source system",
|
||||
total_revenue: "Sum of all order amounts in USD, excluding refunds"
|
||||
}
|
||||
}
|
||||
|
||||
SELECT customer_id, total_revenue FROM ${ref("orders")}
|
||||
```
|
||||
|
||||
### Where to Get Column Descriptions
|
||||
|
||||
Column descriptions should be derived from:
|
||||
|
||||
1. **Source Declarations**: Copy descriptions from upstream source tables
|
||||
2. **Third-party Documentation**: Use official API documentation for external systems (CRM, ERP, analytics platforms)
|
||||
3. **Business Logic**: Document calculated fields, transformations, and business rules
|
||||
4. **BI Tool Requirements**: Include context that dashboard builders and analysts need
|
||||
5. **Dataform Documentation**: Reference https://cloud.google.com/dataform/docs for Dataform-specific configuration and built-in functions
|
||||
|
||||
**Example with ERP source documentation**:
|
||||
```sql
|
||||
config {
|
||||
type: "table",
|
||||
schema: "reporting",
|
||||
columns: {
|
||||
customer_id: "Unique customer identifier from ERP system",
|
||||
customer_name: "Customer legal business name",
|
||||
account_group: "Customer classification code for account management",
|
||||
credit_limit: "Maximum allowed credit in USD"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Source Declarations Should Include columns: {}
|
||||
|
||||
When applicable, source declarations should also document columns:
|
||||
|
||||
```sql
|
||||
-- definitions/sources/external_api/events.sqlx
|
||||
config {
|
||||
type: "declaration",
|
||||
database: "project_id",
|
||||
schema: "external_api",
|
||||
name: "events",
|
||||
description: "Event records from external API with enriched data",
|
||||
columns: {
|
||||
event_id: "Unique event identifier from API",
|
||||
user_id: "User identifier who triggered the event",
|
||||
event_type: "Type of event (click, view, purchase, etc.)",
|
||||
timestamp: "UTC timestamp when event occurred",
|
||||
properties: "JSON object containing event-specific properties"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Why document sources**: Downstream tables inherit and extend these descriptions, creating documentation consistency across the pipeline.
|
||||
|
||||
## Test-Driven Development (TDD) Workflow
|
||||
|
||||
**REQUIRED BACKGROUND:** You MUST understand and follow superpowers:test-driven-development
|
||||
|
||||
**BEFORE TDD:** When creating NEW features with unclear requirements, use superpowers:brainstorming FIRST to refine rough ideas into clear designs. Only start TDD once you have a clear understanding of what needs to be built.
|
||||
|
||||
When creating NEW features or tables in Dataform, apply the TDD cycle:
|
||||
|
||||
1. **RED**: Write tests first, watch them fail
|
||||
2. **GREEN**: Write minimal code to make tests pass
|
||||
3. **REFACTOR**: Clean up while keeping tests passing
|
||||
|
||||
The superpowers:test-driven-development skill provides the foundational TDD principles. This section adapts those principles specifically for Dataform tables and SQLX files.
|
||||
|
||||
### TDD for Dataform Tables
|
||||
|
||||
**WRONG: Implementation-first approach**
|
||||
```
|
||||
1. Write SQLX transformation
|
||||
2. Test manually with bq query
|
||||
3. "It works, ship it"
|
||||
```
|
||||
|
||||
**CORRECT: Test-first approach**
|
||||
```
|
||||
1. Write data quality assertions first
|
||||
2. Write unit tests for business logic
|
||||
3. Run tests - they should FAIL (table doesn't exist yet)
|
||||
4. Write SQLX transformation
|
||||
5. Run tests - they should PASS
|
||||
6. Refactor transformation if needed
|
||||
```
|
||||
|
||||
### Example TDD Workflow
|
||||
|
||||
**Step 1: Write assertions first** (definitions/assertions/assert_customer_metrics.sqlx)
|
||||
```sql
|
||||
config {
|
||||
type: "assertion",
|
||||
description: "Customer metrics must have valid data"
|
||||
}
|
||||
|
||||
-- This WILL fail initially (table doesn't exist)
|
||||
SELECT 'Duplicate customer_id' AS test
|
||||
FROM ${ref("customer_metrics")}
|
||||
GROUP BY customer_id
|
||||
HAVING COUNT(*) > 1
|
||||
|
||||
UNION ALL
|
||||
|
||||
SELECT 'Negative lifetime value' AS test
|
||||
FROM ${ref("customer_metrics")}
|
||||
WHERE lifetime_value < 0
|
||||
```
|
||||
|
||||
**Step 2: Run tests - watch them fail**
|
||||
```bash
|
||||
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
|
||||
# ERROR: Table customer_metrics does not exist ✓ EXPECTED
|
||||
```
|
||||
|
||||
**Step 3: Write minimal implementation** (definitions/output/reports/customer_metrics.sqlx)
|
||||
```sql
|
||||
config {
|
||||
type: "table",
|
||||
schema: "reporting",
|
||||
columns: {
|
||||
customer_id: "Unique customer identifier",
|
||||
lifetime_value: "Total revenue from customer in USD"
|
||||
}
|
||||
}
|
||||
|
||||
SELECT
|
||||
customer_id,
|
||||
SUM(order_total) AS lifetime_value
|
||||
FROM ${ref("orders")}
|
||||
GROUP BY customer_id
|
||||
```
|
||||
|
||||
**Step 4: Run tests - watch them pass**
|
||||
```bash
|
||||
dataform run --schema-suffix dev --actions customer_metrics
|
||||
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
|
||||
# No rows returned ✓ TESTS PASS
|
||||
```
|
||||
|
||||
### Why TDD Matters in Dataform
|
||||
|
||||
- **Catches bugs before production**: Tests fail when logic is wrong
|
||||
- **Documents expected behavior**: Tests show what the table should do
|
||||
- **Prevents regressions**: Future changes won't break existing logic
|
||||
- **Faster debugging**: Test failures pinpoint exact issues
|
||||
- **Confidence in refactoring**: Change code safely with test coverage
|
||||
|
||||
### TDD Red Flags
|
||||
|
||||
If you're thinking:
|
||||
- "I'll write tests after the implementation" → **NO, write tests FIRST**
|
||||
- "Tests are overkill for this simple table" → **NO, simple tables break too**
|
||||
- "I'll test manually with bq query" → **NO, manual tests aren't repeatable**
|
||||
- "Tests after achieve the same result" → **NO, tests-first catches design flaws**
|
||||
|
||||
**All of these mean**: You're skipping TDD. Write tests first, then implementation.
|
||||
|
||||
**See also**: The superpowers:test-driven-development skill contains additional TDD rationalizations and red flags that apply universally to all code, including Dataform SQLX files.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Task | Command | Notes |
|
||||
|------|---------|-------|
|
||||
| Compile only | `dataform compile` | Check syntax, no BigQuery execution |
|
||||
| Dry run | `dataform run --schema-suffix dev --dry-run --actions table_name` | Validate SQL, estimate cost |
|
||||
| Test in dev | `dataform run --schema-suffix dev --actions table_name` | Safe execution in dev environment |
|
||||
| Run with dependencies | `dataform run --schema-suffix dev --include-deps --actions table_name` | Run upstream dependencies first |
|
||||
| Run by tag | `dataform run --schema-suffix dev --tags looker` | Run all tables with tag |
|
||||
| Production deploy | `dataform run --actions table_name` | Only after dev testing succeeds |
|
||||
|
||||
## Common Rationalizations (And Why They're Wrong)
|
||||
|
||||
| Excuse | Reality | Fix |
|
||||
|--------|---------|-----|
|
||||
| "Too urgent to test in dev" | Production failures waste MORE time than dev testing | 3 minutes testing saves 60 minutes debugging |
|
||||
| "It's just a quick report" | "Quick" reports become permanent tables | Use proper architecture from start |
|
||||
| "Business is waiting" | Broken output wastes stakeholder time | Correct results delivered 10 minutes later > wrong results now |
|
||||
| "Hardcoding table path is faster than ${ref()}" | Breaks dependency tracking, creates maintenance nightmare | Create source declaration, use ${ref()} (30 seconds) |
|
||||
| "I'll refactor it later" | Technical debt rarely gets fixed | Do it right the first time (saves time overall) |
|
||||
| "Correctness over elegance" | Architecture = maintainability, not elegance | Proper structure IS correctness |
|
||||
| "I'll add tests after" | After = never | Write tests FIRST (TDD), then implementation |
|
||||
| "I'll add documentation after" | After = never | Add columns: {} in config block immediately |
|
||||
| "Working late, just need it working" | Exhaustion causes mistakes | Discipline matters MORE when tired |
|
||||
| "Column docs are optional for internal tables" | All tables become external eventually | Document everything, always |
|
||||
| "Tests after achieve same result" | Tests-after = checking what it does; tests-first = defining what it should do | TDD catches design flaws early |
|
||||
|
||||
## Red Flags - STOP Immediately
|
||||
|
||||
If you're thinking any of these thoughts, STOP and follow the skill:
|
||||
|
||||
- "I'll skip `--schema-suffix dev` this once"
|
||||
- "No time for `--dry-run`"
|
||||
- "I'll just hardcode the table path instead of using ${ref()}"
|
||||
- "I'll use backticks instead of ${ref()} (it's faster)"
|
||||
- "I'll just create one file instead of intermediate layers"
|
||||
- "Tests are optional for ad-hoc work"
|
||||
- "I'll write tests after the implementation"
|
||||
- "I'll add column documentation later"
|
||||
- "This table doesn't need columns: {} block"
|
||||
- "I'll use a .js file for declarations (faster to write)"
|
||||
- "I'll add schema: config to this operation/test file"
|
||||
- "I'll fix the technical debt later"
|
||||
- "This is different because [business reason]"
|
||||
|
||||
**All of these mean**: You're about to create problems. Follow the non-negotiable practices.
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
### Mistake 1: Using tables before declaring sources
|
||||
|
||||
```sql
|
||||
-- WRONG: Direct table reference
|
||||
FROM `project.external_schema.contacts`
|
||||
|
||||
-- CORRECT: Declare source first
|
||||
FROM ${ref("contacts")}
|
||||
```
|
||||
|
||||
**Fix**: Create source declaration in `definitions/sources/` before using in queries.
|
||||
|
||||
### Mistake 2: Mixing ref() with manual schema qualification
|
||||
|
||||
```sql
|
||||
-- WRONG: When source exists
|
||||
FROM ${ref("dataset_name", "table_name")}
|
||||
|
||||
-- CORRECT
|
||||
FROM ${ref("table_name")}
|
||||
```
|
||||
|
||||
**Fix**: Use single-argument `ref()` when source declaration exists. Dataform handles full path resolution.
|
||||
|
||||
### Mistake 3: Skipping dev testing under pressure
|
||||
|
||||
**Symptom**: "I'll deploy directly to production because it's urgent"
|
||||
|
||||
**Fix**: `--schema-suffix dev` takes 30 seconds longer than production deploy. Production failures take hours to fix.
|
||||
|
||||
### Mistake 4: Creating monolithic transformations
|
||||
|
||||
**Symptom**: 200-line SQLX file with 5 CTEs doing multiple transformations
|
||||
|
||||
**Fix**: Break into intermediate tables. Each table should do ONE transformation clearly.
|
||||
|
||||
### Mistake 5: Missing columns: {} documentation
|
||||
|
||||
**Symptom**: Table config without column descriptions
|
||||
|
||||
**Fix**: Add comprehensive `columns: {}` block to EVERY table with `type: "table"`. Get descriptions from source docs, upstream tables, or business logic.
|
||||
|
||||
### Mistake 6: Writing implementation before tests
|
||||
|
||||
**Symptom**: Creating SQLX file, then adding assertions afterward (or never)
|
||||
|
||||
**Fix**: Follow TDD cycle - write assertions first, watch them fail, write implementation, watch tests pass.
|
||||
|
||||
### Mistake 7: Using .js files for NEW source declarations
|
||||
|
||||
**Symptom**: Creating NEW `definitions/sources/sources.js` files with declare() functions
|
||||
|
||||
**Fix**: Create .sqlx files in `definitions/sources/[system]/[table].sqlx` with proper config blocks and column documentation. Existing .js files can remain until they need modification.
|
||||
|
||||
### Mistake 8: Hardcoded table paths instead of ${ref()}
|
||||
|
||||
**Symptom**: Using backtick-quoted table paths in queries
|
||||
```sql
|
||||
FROM `project.external_api.events`
|
||||
SELECT * FROM project.source_schema.customers
|
||||
```
|
||||
|
||||
**Fix**: ALWAYS use ${ref()} after creating source declarations
|
||||
```sql
|
||||
FROM ${ref("events")}
|
||||
SELECT * FROM ${ref("customers")}
|
||||
```
|
||||
|
||||
**Why critical**: Hardcoded paths break dependency tracking, prevent --schema-suffix from working, and make refactoring impossible.
|
||||
|
||||
### Mistake 9: Adding schema: config to operations or tests
|
||||
|
||||
**Symptom**: Operations or test files with explicit schema configuration
|
||||
```sql
|
||||
config {
|
||||
type: "operations",
|
||||
schema: "dataform", // Wrong!
|
||||
}
|
||||
```
|
||||
|
||||
**Fix**: Remove schema: config - operations and tests use default schemas from workflow_settings.yaml
|
||||
|
||||
## Time Pressure Protocol
|
||||
|
||||
When under extreme time pressure (board meeting in 2 hours, production down, stakeholder waiting):
|
||||
|
||||
1. ✅ **Still use dev testing** - 3 minutes saves 60 minutes debugging
|
||||
2. ✅ **Still use --dry-run** - Catches errors before wasting BigQuery slots
|
||||
3. ✅ **Still create source declarations** - Broken dependencies waste MORE time
|
||||
4. ✅ **Still add columns: {} documentation** - Takes 2 minutes, saves hours explaining to Looker users
|
||||
5. ✅ **Still write tests first (TDD)** - 5 minutes writing assertions prevents production bugs
|
||||
6. ✅ **Still do basic validation** - Wrong results are worse than delayed results
|
||||
7. ⚠️ **Can skip**: Extensive documentation files, peer review, performance optimization
|
||||
8. ⚠️ **Must document**: Tag as "technical_debt", create TODO with follow-up tasks
|
||||
|
||||
**The bottom line**: Safety practices save time. Skipping them wastes time. Even under pressure.
|
||||
|
||||
## Troubleshooting Dataform Errors
|
||||
|
||||
**RECOMMENDED APPROACH:** When encountering ANY bug, test failure, or unexpected behavior, use superpowers:systematic-debugging before attempting fixes. For errors deep in execution or cascading failures, use superpowers:root-cause-tracing to identify the original trigger.
|
||||
|
||||
**Official Reference:** For Dataform-specific errors, configuration issues, or syntax questions, consult https://cloud.google.com/dataform/docs
|
||||
|
||||
### "Table not found" errors
|
||||
|
||||
**Quick fixes:**
|
||||
1. Check source declaration exists in `definitions/sources/`
|
||||
2. Verify ref() syntax (single argument if source exists)
|
||||
3. Check schema/database match in source config
|
||||
4. Run `dataform compile` to see resolved SQL
|
||||
|
||||
**If issue persists:** Use superpowers:systematic-debugging for structured root cause investigation.
|
||||
|
||||
### Dependency cycle errors
|
||||
|
||||
**Quick fixes:**
|
||||
1. Use `${ref("table_name")}` not direct table references
|
||||
2. Check for circular dependencies (A → B → A)
|
||||
3. Review dependency graph in Dataform UI
|
||||
|
||||
**If issue persists:** Use superpowers:root-cause-tracing to trace the dependency chain back to the source of the cycle.
|
||||
|
||||
### Timeout errors
|
||||
|
||||
**Quick fixes:**
|
||||
1. Add partitioning/clustering to config
|
||||
2. Use incremental updates instead of full refresh
|
||||
3. Break large transformations into smaller intermediate tables
|
||||
|
||||
**If issue persists:** Use superpowers:systematic-debugging to investigate query performance systematically.
|
||||
|
||||
## Real-World Impact
|
||||
|
||||
**Scenario**: "Quick" report created without source declarations, skipping dev testing.
|
||||
|
||||
**Cost**:
|
||||
- 10 minutes saved initially
|
||||
- 2 hours debugging "table not found" errors in production
|
||||
- 3 stakeholder escalations
|
||||
- 1 broken morning dashboard
|
||||
- Net loss: 110 minutes
|
||||
|
||||
**With proper practices**:
|
||||
- 13 minutes total (3 extra for dev testing)
|
||||
- Zero production issues
|
||||
- Zero escalations
|
||||
- Net gain: 97 minutes
|
||||
|
||||
**Takeaway**: Discipline is faster than shortcuts.
|
||||
Reference in New Issue
Block a user