gh-ihistand-claude-plugins-…/skills/dataform-engineering-fundamentals.md

---
name: dataform-engineering-fundamentals
description: Use when developing BigQuery Dataform transformations, SQLX files, source declarations, or troubleshooting pipelines - enforces TDD workflow (tests first), ALWAYS use ${ref()} never hardcoded table paths, comprehensive columns:{} documentation, safety practices (--schema-suffix dev, --dry-run), proper ref() syntax, .sqlx for new declarations, no schema config in operations/tests, and architecture patterns that prevent technical debt under time pressure
---

# Dataform Engineering Fundamentals

## Overview

**Core principle**: Safety practices and proper architecture are NEVER optional in Dataform development, regardless of time pressure or business urgency.

**REQUIRED FOUNDATION:** This skill builds upon superpowers:test-driven-development. All TDD principles from that skill apply to Dataform development. This skill adapts TDD specifically for BigQuery Dataform SQLX files.

**Official Documentation:** For Dataform syntax, configuration options, and API reference, see https://cloud.google.com/dataform/docs

**Best Practices Guide:** For repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories

Time pressure does not justify skipping safety checks or creating technical debt. The time "saved" by shortcuts gets multiplied into hours of debugging, broken dependencies, and production issues.

## When to Use

Use this skill for ANY Dataform work:
- Creating new SQLX transformations
- Modifying existing tables
- Adding data sources
- Troubleshooting pipeline failures
- "Quick" reports or ad-hoc analysis

**Especially** use when:
- Under time pressure or deadlines
- Stakeholders are waiting
- Working late at night (exhausted)
- Tempted to "just make it work"

**Related Skills**:
- **Before designing new features**: Use superpowers:brainstorming to refine requirements into clear designs before writing any code
- **When troubleshooting failures**: Use superpowers:systematic-debugging for structured problem-solving
- **When debugging complex issues**: Use superpowers:root-cause-tracing to trace errors back to their source
- **When writing documentation, commit messages, or any prose**: Use elements-of-style:writing-clearly-and-concisely to apply Strunk's timeless writing rules for clarity and conciseness

## Non-Negotiable Safety Practices

These are ALWAYS required. No exceptions for deadlines, urgency, or "simple" tasks:

### 1. Always Use `--schema-suffix dev` for Testing

```bash
# WRONG: Testing in production
dataform run --actions my_table

# CORRECT: Test in dev first
dataform run --schema-suffix dev --actions my_table
```

**Why**: Writes to `schema_dev.my_table` instead of `schema_prod.my_table` (or adds `_dev` suffix based on your configuration). Allows safe testing without impacting production data or dashboards.

### 2. Always Use `--dry-run` Before Execution

```bash
# Check compilation
dataform compile

# Validate SQL without executing
dataform run --schema-suffix dev --dry-run --actions my_table

# Only then execute
dataform run --schema-suffix dev --actions my_table
```

**Why**: Catches SQL errors, missing dependencies, and cost estimation before using BigQuery slots.

### 3. Source Declarations Before ref()

**WRONG**: Using tables without source declarations
```sql
-- This will break dependency tracking
FROM `project_id.external_schema.table_name`
```

**CORRECT**: Create source declaration first
```sql
-- definitions/sources/external_system/table_name.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_schema",
  name: "table_name"
}

-- Then reference it
FROM ${ref("table_name")}
```

### 4. ALWAYS Use ${ref()} - NEVER Hardcoded Table Paths

**WRONG**: Hardcoded table paths
```sql
-- NEVER do this
FROM `project.external_schema.table_name`
FROM `project.reporting_schema.customer_metrics`
SELECT * FROM project.source_schema.customers
```

**CORRECT**: Always use ${ref()}
```sql
-- Create source declaration first, then reference
FROM ${ref("table_name")}
FROM ${ref("customer_metrics")}
SELECT * FROM ${ref("customers")}
```

**Why**:
- Dataform tracks dependencies automatically with ref()
- Hardcoded paths break dependency graphs
- ref() enables --schema-suffix to work correctly
- Refactoring is easier when references are managed

**Exception**: None. There is NO valid reason to use hardcoded table paths in SQLX files.

### 5. Proper ref() Syntax

**WRONG**: Including schema in ref() unnecessarily
```sql
FROM ${ref("external_schema", "sales_order")}
```

**CORRECT**: Use single argument when source declared
```sql
FROM ${ref("sales_order")}
```

**When to use two-argument ref()**:
- Source declarations that haven't been imported yet
- Special schema architectures where schema suffix behavior needs explicit control
- Cross-database references in multi-project setups

**Why**:
- Single-argument ref() works for most tables
- Dataform resolves the full path from source declarations
- Two-argument form is only needed for special cases

### 6. Basic Validation Queries

Always verify your output:
```bash
# Check row counts
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM \`project.schema_dev.my_table\`"

# Check for nulls in critical fields
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) FROM \`project.schema_dev.my_table\`
   WHERE key_field IS NULL"
```

**Why**: Catches silent failures (empty tables, null values, bad joins) immediately.

## Architecture Patterns (Not Optional)

Even for "quick" work, follow these patterns:

**Reference:** For detailed guidance on repository structure, naming conventions, and managing large workflows, see https://cloud.google.com/dataform/docs/best-practices-repositories

### Layered Structure

```
definitions/
  sources/          # External data declarations
  intermediate/     # Transformations and business logic
  output/           # Final tables for consumption
    reports/        # Reporting tables
    marts/          # Data marts for specific use cases
```

**Don't**: Create monolithic queries directly in output layer

**Do**: Break into intermediate steps for reusability and testing

### Incremental vs Full Refresh

```sql
config {
  type: "incremental",
  uniqueKey: "order_id",
  bigquery: {
    partitionBy: "DATE(order_date)",
    clusterBy: ["customer_id", "product_id"]
  }
}
```

**When to use incremental**: Tables that grow daily (events, transactions, logs)

**When to use full refresh**: Small dimension tables, aggregations with lookback windows

### Dataform Assertions

```sql
config {
  type: "table",
  assertions: {
    uniqueKey: ["call_id"],
    nonNull: ["customer_phone_number", "start_time"],
    rowConditions: ["duration >= 0"]
  }
}
```

**Why**: Catches data quality issues automatically during pipeline runs.

### Source Declarations: Prefer .sqlx Files

**STRONGLY PREFER**: .sqlx files for ALL new declarations
```sql
-- definitions/sources/external_system/table_name.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_schema",
  name: "table_name",
  columns: {
    id: "Unique identifier for records",
    // ... more columns
  }
}
```

**ACCEPTABLE (legacy only)**: .js files for existing declarations
```javascript
// definitions/sources/legacy_declarations.js (existing file)
declare({
  database: "project_id",
  schema: "source_schema",
  name: "customers"
});
```

**Rule**: ALL NEW source declarations MUST be .sqlx files. Existing .js declarations can remain but should be migrated to .sqlx when modifying them.

**Why**: .sqlx files support column documentation, are more maintainable, and integrate better with Dataform's dependency tracking.

### Schema Configuration Rules

**Operations**: Files in `definitions/operations/` should NOT include `schema:` config
```sql
-- CORRECT
config {
  type: "operations",
  tags: ["daily"]
}

-- WRONG
config {
  type: "operations",
  schema: "dataform",  // DON'T specify schema
  tags: ["daily"]
}
```

**Tests/Assertions**: Files in `definitions/test/` should NOT include `schema:` config
```sql
-- CORRECT
config {
  type: "assertion",
  description: "Check for duplicates"
}

-- WRONG
config {
  type: "assertion",
  schema: "dataform_assertions",  // DON'T specify schema
  description: "Check for duplicates"
}
```

**Why**: Operations live in the default `dataform` schema and assertions live in `dataform_assertions` schema (configured in `workflow_settings.yaml`). Specifying schema explicitly can cause conflicts.

## Documentation Standards (Non-Negotiable)

All tables with `type: "table"` MUST include comprehensive `columns: {}` documentation in the config block.

**Writing Clear Documentation**: When writing column descriptions, commit messages, or any prose that humans will read, use elements-of-style:writing-clearly-and-concisely to ensure clarity and conciseness.

### columns: {} Requirement

**WRONG**: Table without column documentation
```sql
config {
  type: "table",
  schema: "reporting"
}

SELECT customer_id, total_revenue FROM ${ref("orders")}
```

**CORRECT**: Complete column documentation
```sql
config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier from source system",
    total_revenue: "Sum of all order amounts in USD, excluding refunds"
  }
}

SELECT customer_id, total_revenue FROM ${ref("orders")}
```

### Where to Get Column Descriptions

Column descriptions should be derived from:

1. **Source Declarations**: Copy descriptions from upstream source tables
2. **Third-party Documentation**: Use official API documentation for external systems (CRM, ERP, analytics platforms)
3. **Business Logic**: Document calculated fields, transformations, and business rules
4. **BI Tool Requirements**: Include context that dashboard builders and analysts need
5. **Dataform Documentation**: Reference https://cloud.google.com/dataform/docs for Dataform-specific configuration and built-in functions

**Example with ERP source documentation**:
```sql
config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier from ERP system",
    customer_name: "Customer legal business name",
    account_group: "Customer classification code for account management",
    credit_limit: "Maximum allowed credit in USD"
  }
}
```

### Source Declarations Should Include columns: {}

When applicable, source declarations should also document columns:

```sql
-- definitions/sources/external_api/events.sqlx
config {
  type: "declaration",
  database: "project_id",
  schema: "external_api",
  name: "events",
  description: "Event records from external API with enriched data",
  columns: {
    event_id: "Unique event identifier from API",
    user_id: "User identifier who triggered the event",
    event_type: "Type of event (click, view, purchase, etc.)",
    timestamp: "UTC timestamp when event occurred",
    properties: "JSON object containing event-specific properties"
  }
}
```

**Why document sources**: Downstream tables inherit and extend these descriptions, creating documentation consistency across the pipeline.

## Test-Driven Development (TDD) Workflow

**REQUIRED BACKGROUND:** You MUST understand and follow superpowers:test-driven-development

**BEFORE TDD:** When creating NEW features with unclear requirements, use superpowers:brainstorming FIRST to refine rough ideas into clear designs. Only start TDD once you have a clear understanding of what needs to be built.

When creating NEW features or tables in Dataform, apply the TDD cycle:

1. **RED**: Write tests first, watch them fail
2. **GREEN**: Write minimal code to make tests pass
3. **REFACTOR**: Clean up while keeping tests passing

The superpowers:test-driven-development skill provides the foundational TDD principles. This section adapts those principles specifically for Dataform tables and SQLX files.

### TDD for Dataform Tables

**WRONG: Implementation-first approach**
```
1. Write SQLX transformation
2. Test manually with bq query
3. "It works, ship it"
```

**CORRECT: Test-first approach**
```
1. Write data quality assertions first
2. Write unit tests for business logic
3. Run tests - they should FAIL (table doesn't exist yet)
4. Write SQLX transformation
5. Run tests - they should PASS
6. Refactor transformation if needed
```

### Example TDD Workflow

**Step 1: Write assertions first** (definitions/assertions/assert_customer_metrics.sqlx)
```sql
config {
  type: "assertion",
  description: "Customer metrics must have valid data"
}

-- This WILL fail initially (table doesn't exist)
SELECT 'Duplicate customer_id' AS test
FROM ${ref("customer_metrics")}
GROUP BY customer_id
HAVING COUNT(*) > 1

UNION ALL

SELECT 'Negative lifetime value' AS test
FROM ${ref("customer_metrics")}
WHERE lifetime_value < 0
```

**Step 2: Run tests - watch them fail**
```bash
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# ERROR: Table customer_metrics does not exist ✓ EXPECTED
```

**Step 3: Write minimal implementation** (definitions/output/reports/customer_metrics.sqlx)
```sql
config {
  type: "table",
  schema: "reporting",
  columns: {
    customer_id: "Unique customer identifier",
    lifetime_value: "Total revenue from customer in USD"
  }
}

SELECT
  customer_id,
  SUM(order_total) AS lifetime_value
FROM ${ref("orders")}
GROUP BY customer_id
```

**Step 4: Run tests - watch them pass**
```bash
dataform run --schema-suffix dev --actions customer_metrics
dataform run --schema-suffix dev --run-tests --actions assert_customer_metrics
# No rows returned ✓ TESTS PASS
```

### Why TDD Matters in Dataform

- **Catches bugs before production**: Tests fail when logic is wrong
- **Documents expected behavior**: Tests show what the table should do
- **Prevents regressions**: Future changes won't break existing logic
- **Faster debugging**: Test failures pinpoint exact issues
- **Confidence in refactoring**: Change code safely with test coverage

### TDD Red Flags

If you're thinking:
- "I'll write tests after the implementation" → **NO, write tests FIRST**
- "Tests are overkill for this simple table" → **NO, simple tables break too**
- "I'll test manually with bq query" → **NO, manual tests aren't repeatable**
- "Tests after achieve the same result" → **NO, tests-first catches design flaws**

**All of these mean**: You're skipping TDD. Write tests first, then implementation.

**See also**: The superpowers:test-driven-development skill contains additional TDD rationalizations and red flags that apply universally to all code, including Dataform SQLX files.

## Quick Reference

| Task | Command | Notes |
|------|---------|-------|
| Compile only | `dataform compile` | Check syntax, no BigQuery execution |
| Dry run | `dataform run --schema-suffix dev --dry-run --actions table_name` | Validate SQL, estimate cost |
| Test in dev | `dataform run --schema-suffix dev --actions table_name` | Safe execution in dev environment |
| Run with dependencies | `dataform run --schema-suffix dev --include-deps --actions table_name` | Run upstream dependencies first |
| Run by tag | `dataform run --schema-suffix dev --tags looker` | Run all tables with tag |
| Production deploy | `dataform run --actions table_name` | Only after dev testing succeeds |

## Common Rationalizations (And Why They're Wrong)

| Excuse | Reality | Fix |
|--------|---------|-----|
| "Too urgent to test in dev" | Production failures waste MORE time than dev testing | 3 minutes testing saves 60 minutes debugging |
| "It's just a quick report" | "Quick" reports become permanent tables | Use proper architecture from start |
| "Business is waiting" | Broken output wastes stakeholder time | Correct results delivered 10 minutes later > wrong results now |
| "Hardcoding table path is faster than ${ref()}" | Breaks dependency tracking, creates maintenance nightmare | Create source declaration, use ${ref()} (30 seconds) |
| "I'll refactor it later" | Technical debt rarely gets fixed | Do it right the first time (saves time overall) |
| "Correctness over elegance" | Architecture = maintainability, not elegance | Proper structure IS correctness |
| "I'll add tests after" | After = never | Write tests FIRST (TDD), then implementation |
| "I'll add documentation after" | After = never | Add columns: {} in config block immediately |
| "Working late, just need it working" | Exhaustion causes mistakes | Discipline matters MORE when tired |
| "Column docs are optional for internal tables" | All tables become external eventually | Document everything, always |
| "Tests after achieve same result" | Tests-after = checking what it does; tests-first = defining what it should do | TDD catches design flaws early |

## Red Flags - STOP Immediately

If you're thinking any of these thoughts, STOP and follow the skill:

- "I'll skip `--schema-suffix dev` this once"
- "No time for `--dry-run`"
- "I'll just hardcode the table path instead of using ${ref()}"
- "I'll use backticks instead of ${ref()} (it's faster)"
- "I'll just create one file instead of intermediate layers"
- "Tests are optional for ad-hoc work"
- "I'll write tests after the implementation"
- "I'll add column documentation later"
- "This table doesn't need columns: {} block"
- "I'll use a .js file for declarations (faster to write)"
- "I'll add schema: config to this operation/test file"
- "I'll fix the technical debt later"
- "This is different because [business reason]"

**All of these mean**: You're about to create problems. Follow the non-negotiable practices.

## Common Mistakes

### Mistake 1: Using tables before declaring sources

```sql
-- WRONG: Direct table reference
FROM `project.external_schema.contacts`

-- CORRECT: Declare source first
FROM ${ref("contacts")}
```

**Fix**: Create source declaration in `definitions/sources/` before using in queries.

### Mistake 2: Mixing ref() with manual schema qualification

```sql
-- WRONG: When source exists
FROM ${ref("dataset_name", "table_name")}

-- CORRECT
FROM ${ref("table_name")}
```

**Fix**: Use single-argument `ref()` when source declaration exists. Dataform handles full path resolution.

### Mistake 3: Skipping dev testing under pressure

**Symptom**: "I'll deploy directly to production because it's urgent"

**Fix**: `--schema-suffix dev` takes 30 seconds longer than production deploy. Production failures take hours to fix.

### Mistake 4: Creating monolithic transformations

**Symptom**: 200-line SQLX file with 5 CTEs doing multiple transformations

**Fix**: Break into intermediate tables. Each table should do ONE transformation clearly.

### Mistake 5: Missing columns: {} documentation

**Symptom**: Table config without column descriptions

**Fix**: Add comprehensive `columns: {}` block to EVERY table with `type: "table"`. Get descriptions from source docs, upstream tables, or business logic.

### Mistake 6: Writing implementation before tests

**Symptom**: Creating SQLX file, then adding assertions afterward (or never)

**Fix**: Follow TDD cycle - write assertions first, watch them fail, write implementation, watch tests pass.

### Mistake 7: Using .js files for NEW source declarations

**Symptom**: Creating NEW `definitions/sources/sources.js` files with declare() functions

**Fix**: Create .sqlx files in `definitions/sources/[system]/[table].sqlx` with proper config blocks and column documentation. Existing .js files can remain until they need modification.

### Mistake 8: Hardcoded table paths instead of ${ref()}

**Symptom**: Using backtick-quoted table paths in queries
```sql
FROM `project.external_api.events`
SELECT * FROM project.source_schema.customers
```

**Fix**: ALWAYS use ${ref()} after creating source declarations
```sql
FROM ${ref("events")}
SELECT * FROM ${ref("customers")}
```

**Why critical**: Hardcoded paths break dependency tracking, prevent --schema-suffix from working, and make refactoring impossible.

### Mistake 9: Adding schema: config to operations or tests

**Symptom**: Operations or test files with explicit schema configuration
```sql
config {
  type: "operations",
  schema: "dataform",  // Wrong!
}
```

**Fix**: Remove schema: config - operations and tests use default schemas from workflow_settings.yaml

## Time Pressure Protocol

When under extreme time pressure (board meeting in 2 hours, production down, stakeholder waiting):

1. ✅ **Still use dev testing** - 3 minutes saves 60 minutes debugging
2. ✅ **Still use --dry-run** - Catches errors before wasting BigQuery slots
3. ✅ **Still create source declarations** - Broken dependencies waste MORE time
4. ✅ **Still add columns: {} documentation** - Takes 2 minutes, saves hours explaining to Looker users
5. ✅ **Still write tests first (TDD)** - 5 minutes writing assertions prevents production bugs
6. ✅ **Still do basic validation** - Wrong results are worse than delayed results
7. ⚠️ **Can skip**: Extensive documentation files, peer review, performance optimization
8. ⚠️ **Must document**: Tag as "technical_debt", create TODO with follow-up tasks

**The bottom line**: Safety practices save time. Skipping them wastes time. Even under pressure.

## Troubleshooting Dataform Errors

**RECOMMENDED APPROACH:** When encountering ANY bug, test failure, or unexpected behavior, use superpowers:systematic-debugging before attempting fixes. For errors deep in execution or cascading failures, use superpowers:root-cause-tracing to identify the original trigger.

**Official Reference:** For Dataform-specific errors, configuration issues, or syntax questions, consult https://cloud.google.com/dataform/docs

### "Table not found" errors

**Quick fixes:**
1. Check source declaration exists in `definitions/sources/`
2. Verify ref() syntax (single argument if source exists)
3. Check schema/database match in source config
4. Run `dataform compile` to see resolved SQL

**If issue persists:** Use superpowers:systematic-debugging for structured root cause investigation.

### Dependency cycle errors

**Quick fixes:**
1. Use `${ref("table_name")}` not direct table references
2. Check for circular dependencies (A → B → A)
3. Review dependency graph in Dataform UI

**If issue persists:** Use superpowers:root-cause-tracing to trace the dependency chain back to the source of the cycle.

### Timeout errors

**Quick fixes:**
1. Add partitioning/clustering to config
2. Use incremental updates instead of full refresh
3. Break large transformations into smaller intermediate tables

**If issue persists:** Use superpowers:systematic-debugging to investigate query performance systematically.

## Real-World Impact

**Scenario**: "Quick" report created without source declarations, skipping dev testing.

**Cost**:
- 10 minutes saved initially
- 2 hours debugging "table not found" errors in production
- 3 stakeholder escalations
- 1 broken morning dashboard
- Net loss: 110 minutes

**With proper practices**:
- 13 minutes total (3 extra for dev testing)
- Zero production issues
- Zero escalations
- Net gain: 97 minutes

**Takeaway**: Discipline is faster than shortcuts.