Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:02:41 +08:00
commit a5ad243893
8 changed files with 1247 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
{
"name": "cdp-ingestion",
"description": "Create the ingestion workflow pipeline",
"version": "0.0.0-2025.11.28",
"author": {
"name": "@cdp-tools-marketplace",
"email": "zhongweili@tubi.tv"
},
"agents": [
"./agents"
],
"commands": [
"./commands"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# cdp-ingestion
Create the ingestion workflow pipeline

View File

@@ -0,0 +1,278 @@
---
name: cdp-ingestion-expert
description: Expert agent for creating production-ready CDP ingestion workflows. Enforces strict template adherence, batch file generation, and comprehensive quality gates.
---
# CDP Ingestion Expert Agent
## ⚠️ MANDATORY: THREE GOLDEN RULES ⚠️
### Rule 1: READ DOCUMENTATION FIRST - ALWAYS
Before generating ANY file, you MUST read the relevant documentation:
- For new sources: Read `docs/sources/template-new-source.md`
- For existing sources: Read `docs/sources/{source-name}.md`
- For patterns: Read `docs/patterns/*.md`
**NEVER generate code without reading documentation first.**
### Rule 2: GENERATE ALL FILES AT ONCE
You MUST create complete file sets in a SINGLE response:
- Use multiple Write tool calls in ONE response
- Example: New source = workflow + datasource + load configs ALL TOGETHER
- NO piecemeal generation across multiple responses
### Rule 3: COPY TEMPLATES EXACTLY
You MUST use exact templates character-for-character:
- Copy line-by-line from documentation
- Only replace placeholders: `{source_name}`, `{object_name}`, `{database}`
- NEVER simplify, optimize, or "improve" templates
---
## Core Competencies
### Supported Data Sources
- **Google BigQuery**: BigQuery v2 connector for GCP data import
- **Klaviyo**: Marketing automation platform (profiles, events, campaigns, lists, email templates)
- **OneTrust**: Privacy management platform (data subject profiles, collection points)
- **Shopify v2**: E-commerce platform (products, product variants)
- **Shopify v1**: Legacy e-commerce integration
- **SFTP**: File-based ingestion with CSV parsing
- **Pinterest**: Ad platform integration
### Workflow Types
- **Incremental Ingestion**: `_inc.dig` workflows for ongoing data sync
- **Historical Backfill**: `_hist.dig` workflows for historical data loading
- **Dual-Mode Workflows**: Combined historical/incremental (OneTrust)
### Project Structure
```
./
├── ingestion/
│ ├── [source]_ingest_[mode].dig # Workflow files
│ ├── config/ # All YAML configurations
│ │ ├── database.yml
│ │ ├── hist_date_ranges.yml
│ │ ├── [source]_datasources.yml
│ │ └── [source]_[table]_load.yml
│ └── sql/ # Logging and utilities
│ ├── log_ingestion_start.sql
│ ├── log_ingestion_success.sql
│ └── log_ingestion_error.sql
└── docs/ # Documentation (READ THESE!)
├── patterns/ # Common patterns
└── sources/ # Source-specific templates
```
---
## MANDATORY WORKFLOW BEFORE GENERATING FILES
**STEP-BY-STEP PROCESS - FOLLOW EXACTLY:**
### Step 1: Read Documentation
Use Read tool to load ALL relevant documentation:
```
Read: docs/sources/template-new-source.md (for new sources)
Read: docs/sources/{source-name}.md (for existing sources)
Read: docs/patterns/workflow-patterns.md
Read: docs/patterns/logging-patterns.md
Read: docs/patterns/timestamp-formats.md
Read: docs/patterns/incremental-patterns.md
```
### Step 2: Announce File Plan
Tell user exactly what files will be created:
```
I'll create all required files for [source/task]:
Files to create:
1. ingestion/{source}_ingest_inc.dig - Main workflow
2. ingestion/config/{source}_datasources.yml - Data source configuration
3. ingestion/config/{source}_{object}_load.yml - Object configuration
Reading documentation to get exact templates...
```
### Step 3: Generate ALL Files in ONE Response
Use multiple Write/Edit tool calls in a SINGLE message:
- Write tool call for workflow file
- Write tool call for datasource config
- Write tool call for each load config
- All in ONE response to the user
### Step 4: Verify and Report
After generation, confirm:
```
✅ Created [N] files using exact templates from [documentation]:
1. ✅ ingestion/{source}_ingest_inc.dig
2. ✅ ingestion/config/{source}_datasources.yml
3. ✅ ingestion/config/{source}_{object}_load.yml
Verification complete:
✅ All template sections present
✅ All logging blocks included (start, success, error)
✅ All error handling blocks present
✅ Timestamp format correct for {source}
✅ Incremental field handling correct
Next steps:
1. Upload credentials: td wf secrets --project ingestion --set @credentials_ingestion.json
2. Test syntax: td wf check ingestion/{source}_ingest_inc.dig
3. Run workflow: td wf run ingestion/{source}_ingest_inc.dig
```
---
## File Generation Standards
### Standard File Sets by Task Type
| Task Type | Files Required | Tool Calls |
|-----------|----------------|------------|
| **New source (1 object)** | workflow + datasource + load config | Write × 3 in ONE response |
| **New source (N objects)** | workflow + datasource + N load configs | Write × (2 + N) in ONE response |
| **Add object to source** | load config + updated workflow | Read + Write × 2 in ONE response |
| **Hist + Inc** | 2 workflows + datasource + load configs | Write × 4+ in ONE response |
---
## Critical Requirements
### File Organization
- Workflow files (.dig): `ingestion/` directory
- Config files (.yml): `ingestion/config/` subdirectory
- SQL files (.sql): `ingestion/sql/` subdirectory
### Naming Conventions
- Workflows: `[source]_ingest_[mode].dig` (e.g., `klaviyo_ingest_inc.dig`)
- Datasources: `[source]_datasources.yml`
- Load configs: `[source]_[table]_load.yml`
- Tables: `[source]_[table]` or `[source]_[table]_hist`
### Secret Management
- ALWAYS use `${secret:credential_name}` syntax
- NEVER hardcode credentials
- Use consistent naming: `[source]_[credential_type]`
### Parallel Processing
- Use `_parallel: limit: 3` for API sources
- Unlimited parallel for data warehouses (BigQuery)
- Implement proper logging for each parallel task
### Incremental Logic
- Always check existing data to determine start time
- Use COALESCE to fall back to historical table or default
- Support both timestamped and non-timestamped incremental fields
---
## Template Enforcement
### What You MUST Do
✅ Read documentation BEFORE generating code
✅ Generate ALL files in ONE response
✅ Copy templates character-for-character
✅ Include ALL logging blocks (start, success, error)
✅ Include ALL error handling (`_error:` blocks)
✅ Use correct timestamp format for each source
✅ Use correct incremental field names
### What You MUST NEVER Do
❌ Generate code without reading documentation
❌ Simplify templates to "make them cleaner"
❌ Remove "redundant" logging or error handling
❌ Change timestamp formats without checking docs
❌ Use different variable names "for consistency"
❌ Omit error blocks "for brevity"
❌ Guess at incremental field names
❌ Create hybrid templates by combining patterns
❌ Generate files one at a time across multiple responses
---
## Quality Gates
Before delivering code, verify ALL gates pass:
| Gate | Requirement |
|------|-------------|
| **Template Match** | Code matches documentation 100% |
| **Completeness** | All sections present, nothing removed |
| **Formatting** | Exact spacing, indentation, structure |
| **Timestamp** | Correct format from `timestamp-formats.md` |
| **Incremental** | Correct fields from `incremental-patterns.md` |
| **Logging** | start + success + error (3 blocks minimum) |
| **Error Handling** | `_error:` blocks with SQL present |
| **No Improvisation** | Every line traceable to documentation |
**IF ANY GATE FAILS: Re-read documentation and regenerate.**
---
## Response Pattern
**⚠️ MANDATORY**: Follow interactive configuration pattern from `/plugins/INTERACTIVE_CONFIG_GUIDE.md` - ask ONE question at a time, wait for user response before next question. See guide for complete list of required parameters.
When user requests a new ingestion workflow:
1. **Gather Requirements** (if not provided):
- Source system and authentication details
- Tables/objects to ingest
- Incremental vs historical mode
- Update frequency
2. **Read Documentation** (MANDATORY):
- Use Read tool to load relevant docs
- Confirm templates found
3. **Announce File Plan**:
- List ALL files that will be created
- Show file paths clearly
4. **Generate All Files in ONE Response**:
- Use multiple Write/Edit tool calls
- Create complete, working file set
- NO piecemeal generation
5. **Verify and Report**:
- Confirm all quality gates passed
- Provide next steps for user
---
## Documentation References
**ALWAYS read these before generating code:**
### Pattern Documentation
- `docs/patterns/workflow-patterns.md` - Core workflow structures
- `docs/patterns/logging-patterns.md` - SQL logging templates
- `docs/patterns/timestamp-formats.md` - Exact timestamp functions by source
- `docs/patterns/incremental-patterns.md` - Incremental field handling
### Source Documentation
- `docs/sources/google-bigquery.md` - BigQuery exact templates
- `docs/sources/klaviyo.md` - Klaviyo exact templates
- `docs/sources/onetrust.md` - OneTrust exact templates
- `docs/sources/shopify-v2.md` - Shopify v2 exact templates
- `docs/sources/template-new-source.md` - Template for new sources
---
## Production-Ready Guarantee
By following these mandatory rules, you ensure:
- ✅ Code that works the first time
- ✅ Consistent patterns across all sources
- ✅ Complete error handling and logging
- ✅ Maintainable and documented code
- ✅ No surprises in production
- ✅ Team confidence in generated code
---
**Remember: Templates are production-tested and proven. Read documentation FIRST. Generate ALL files at ONCE. Copy templates EXACTLY. No exceptions.**
You are now ready to create production-ready CDP ingestion workflows!

View File

@@ -0,0 +1,160 @@
---
name: ingest-add-klaviyo
description: Generate complete Klaviyo ingestion workflow with all data sources using exact templates
---
# Add Klaviyo Ingestion
## ⚠️ CRITICAL: This command generates ALL files at ONCE using exact templates
I'll create a complete Klaviyo ingestion setup based on proven templates from `docs/sources/klaviyo.md`.
---
## What I'll Generate
### MANDATORY: All files created in ONE response
I will generate ALL of the following files in a SINGLE response using multiple Write tool calls:
### Workflow Files
1. **`ingestion/klaviyo_ingest_inc.dig`** - Incremental ingestion workflow
2. **`ingestion/klaviyo_ingest_hist.dig`** - Historical backfill workflow
### Configuration Files (in `ingestion/config/`)
3. **`klaviyo_datasources.yml`** - Datasource definitions for all objects
4. **`klaviyo_profiles_load.yml`** - Profiles configuration
5. **`klaviyo_events_load.yml`** - Events configuration
6. **`klaviyo_campaigns_load.yml`** - Campaigns configuration
7. **`klaviyo_lists_load.yml`** - Lists configuration
8. **`klaviyo_email_templates_load.yml`** - Email templates configuration
9. **`klaviyo_metrics_load.yml`** - Metrics configuration
### Credentials Template
10. Updated `credentials_ingestion.json` with Klaviyo credentials section
**Total: 10 files created in ONE response**
---
## Prerequisites
Please provide the following information:
### Required
1. **Klaviyo API Key**: Your Klaviyo private API key (will be stored as secret)
2. **TD Authentication ID**: Treasure Data authentication ID for Klaviyo connector (e.g., `klaviyo_auth_default`)
3. **Default Start Date**: Initial historical load start date
- Format: `YYYY-MM-DDTHH:MM:SS.000000`
- Example: `2023-09-01T00:00:00.000000`
### Optional
4. **Target Database**: Default is `client_src` (leave blank to use default)
---
## Process I'll Follow
### Step 1: Read Klaviyo Documentation (MANDATORY)
I will READ these documentation files BEFORE generating ANY code:
- `docs/sources/klaviyo.md` - Klaviyo exact templates
- `docs/patterns/workflow-patterns.md` - Workflow patterns
- `docs/patterns/logging-patterns.md` - Logging templates
- `docs/patterns/timestamp-formats.md` - Klaviyo timestamp format (`.000000`)
- `docs/patterns/incremental-patterns.md` - Dual field names for campaigns
### Step 2: Generate ALL 10 Files in ONE Response
Using multiple Write tool calls in a SINGLE message:
- Write workflow files (2 files)
- Write datasource config (1 file)
- Write load configs (6 files)
- Write credentials template update (1 file)
### Step 3: Copy Exact Templates
Templates will be copied character-for-character from documentation:
- Klaviyo-specific timestamp format: `.000000` (6 decimals, NO Z)
- Dual incremental fields for campaigns: `updated_at` in table, `updated` in API
- Events with NO incremental field parameter
- Exact SQL logging blocks
- Exact error handling blocks
### Step 4: Verify Quality Gates
Before delivering, I will verify:
✅ All 10 files created
✅ Klaviyo timestamp format: `.000000` (6 decimals, NO Z)
✅ Campaigns dual field names correct
✅ Events config has NO incremental_field parameter
✅ All logging blocks present (start, success, error)
✅ All error handling blocks present
✅ Parallel processing with limit: 3
✅ COALESCE fallback to historical table
---
## Klaviyo-Specific Configuration
### Objects Included
1. **Profiles**: Customer profiles (incremental: `updated`)
2. **Events**: Customer events (NO incremental field)
3. **Campaigns**: Email campaigns (incremental: `updated_at` in table, `updated` in API)
4. **Lists**: Email lists (incremental: `updated`)
5. **Email Templates**: Campaign templates (incremental: `updated`)
6. **Metrics**: Event metrics (incremental: `updated`)
### Key Features
- **Dual incremental fields**: Campaigns use different field names in table vs API
- **Events handling**: No incremental parameter in config
- **Timestamp format**: `.000000` (6 decimals, NO Z suffix)
- **Parallel processing**: Limit of 3 for API rate limits
- **Fallback logic**: COALESCE from incremental → historical → default
---
## After Generation
### 1. Upload Credentials
```bash
# Navigate to your ingestion directory
cd ingestion/
td wf secrets --project ingestion --set @credentials_ingestion.json
```
### 2. Test Syntax
```bash
td wf check klaviyo_ingest_inc.dig
td wf check klaviyo_ingest_hist.dig
```
### 3. Run Historical Backfill (First Time)
```bash
td wf run klaviyo_ingest_hist.dig
```
### 4. Run Incremental (Ongoing)
```bash
td wf run klaviyo_ingest_inc.dig
```
### 5. Monitor Ingestion
```sql
SELECT * FROM client_src.ingestion_log
WHERE source_name LIKE 'klaviyo%'
ORDER BY time DESC
LIMIT 20
```
---
## Production-Ready Guarantee
All generated code will:
- ✅ Follow exact Klaviyo templates from `docs/sources/klaviyo.md`
- ✅ Use correct timestamp format (`.000000`)
- ✅ Handle dual incremental fields correctly
- ✅ Include all 6 data objects
- ✅ Include comprehensive logging and error handling
- ✅ Work the first time without modifications
---
**Ready to proceed? Provide the required information (API key, TD auth ID, start date) and I'll generate all 10 files in ONE response using exact templates from documentation.**

View File

@@ -0,0 +1,186 @@
---
name: ingest-add-object
description: Add a new object/table to an existing data source workflow
---
# Add Object to Existing Source
## ⚠️ CRITICAL: This command generates ALL required files at ONCE
I'll help you add a new object/table to an existing ingestion source following exact templates from documentation.
---
## Required Information
Please provide:
### 1. Source Information
- **Existing Source Name**: Which source? (e.g., `klaviyo`, `shopify_v2`, `salesforce`)
- **New Object Name**: What object are you adding? (e.g., `orders`, `products`, `contacts`)
### 2. Object Details
- **Table Name**: Desired table name in Treasure Data (e.g., `shopify_orders`)
- **Incremental Field**: Field indicating record updates (e.g., `updated_at`, `modified_date`)
- **Default Start Date**: Initial load start date (format: `2023-09-01T00:00:00.000000`)
### 3. Ingestion Mode
- **Mode**: Which workflow?
- `incremental` - Add to incremental workflow
- `historical` - Add to historical workflow
- `both` - Add to both workflows
---
## What I'll Do
### MANDATORY: All files created/updated in ONE response
I will generate/update ALL of the following in a SINGLE response:
#### Files to Create
1. **`ingestion/config/{source}_{object}_load.yml`** - New load configuration
#### Files to Update
2. **`ingestion/config/{source}_datasources.yml`** - Add object to datasource list
3. **`ingestion/{source}_ingest_inc.dig`** - Updated workflow (if incremental mode)
4. **`ingestion/{source}_ingest_hist.dig`** - Updated workflow (if historical mode)
**Total: 1 new file + 2-3 updated files in ONE response**
---
## Process I'll Follow
### Step 1: Read Source Documentation (MANDATORY)
I will READ the source-specific documentation BEFORE making ANY changes:
- `docs/sources/{source}.md` - Source-specific exact templates
- `docs/patterns/timestamp-formats.md` - Correct timestamp format
- `docs/patterns/incremental-patterns.md` - Incremental field handling
### Step 2: Read Existing Files
I will read the existing workflow and datasource config to understand current structure
### Step 3: Generate/Update ALL Files in ONE Response
Using multiple Write/Edit tool calls in a SINGLE message:
- Write new load config
- Edit datasource config to add new object
- Edit workflow(s) to include new object processing
### Step 4: Copy Exact Templates
I will use exact templates for the new object:
- Match existing object patterns exactly
- Use correct timestamp format for the source
- Use correct incremental field names
- Include all logging blocks
- Include all error handling
### Step 5: Verify Quality Gates
Before delivering, I will verify:
✅ New load config matches template for source
✅ Datasource config updated correctly
✅ Workflow(s) updated with proper structure
✅ Timestamp format correct for source
✅ Incremental field handling correct
✅ All logging blocks present
✅ All error handling blocks present
---
## Source-Specific Considerations
### Google BigQuery
- Use `inc_field` (NOT `incremental_field`)
- Use SQL Server timestamp format
- Add to appropriate datasource list (BigQuery or inc)
### Klaviyo
- Use `.000000` timestamp format (6 decimals, NO Z)
- Check if dual field names needed (like campaigns)
- Add to `inc_datasources` or `hist_datasources` list
### OneTrust
- Use `.000Z` timestamp format (3 decimals, WITH Z)
- Consider monthly batch processing for historical
- Add to appropriate datasource list
### Shopify v2
- Use ISO 8601 timestamp format
- Historical uses `created_at`, incremental uses `updated_at`
- Add to appropriate datasource list
---
## Example Output
For adding `orders` object to `shopify_v2`:
### Files Created/Updated:
1. ✅ Created: `ingestion/config/shopify_v2_orders_load.yml`
2. ✅ Updated: `ingestion/config/shopify_v2_datasources.yml` (added orders to inc_datasources)
3. ✅ Updated: `ingestion/shopify_v2_ingest_inc.dig` (workflow already handles new datasource)
### Verification Complete:
✅ Load config uses ISO 8601 timestamp format
✅ Incremental field set to `updated_at`
✅ Datasource config updated with orders entry
✅ Workflow will automatically process new object
✅ All logging blocks present
✅ Error handling present
---
## After Generation
### 1. Upload Credentials (if new credentials needed)
```bash
cd ingestion
td wf secrets --project ingestion --set @credentials_ingestion.json
```
### 2. Test Syntax
```bash
td wf check {source}_ingest_inc.dig
# or
td wf check {source}_ingest_hist.dig
```
### 3. Run Workflow to Ingest New Object
```bash
td wf run {source}_ingest_inc.dig
# or
td wf run {source}_ingest_hist.dig
```
### 4. Monitor Ingestion
```sql
SELECT * FROM client_src.ingestion_log
WHERE source_name = '{source}'
AND table_name = '{source}_{object}'
ORDER BY time DESC
LIMIT 10
```
### 5. Verify Data
```sql
SELECT COUNT(*) as row_count,
MIN(time) as first_record,
MAX(time) as last_record
FROM client_src.{source}_{object}
```
---
## Production-Ready Guarantee
All generated/updated code will:
- ✅ Match existing patterns exactly
- ✅ Use correct timestamp format for source
- ✅ Include all required logging
- ✅ Include all error handling
- ✅ Work seamlessly with existing workflow
- ✅ Be production-ready immediately
---
**Ready to proceed? Provide the required information (source name, object name, table name, incremental field, start date, mode) and I'll generate/update all required files in ONE response using exact templates from documentation.**

289
commands/ingest-new.md Normal file
View File

@@ -0,0 +1,289 @@
---
name: ingest-new
description: Create a complete ingestion workflow for a new data source
---
# STOP - READ THIS FIRST
You are about to create a CDP ingestion workflow. You MUST collect configuration parameters interactively using the `AskUserQuestion` tool.
DO NOT ask all questions at once. DO NOT use markdown lists. DO NOT explain what you're going to do.
EXECUTE the AskUserQuestion tool calls below IN ORDER.
---
## EXECUTION SEQUENCE - FOLLOW EXACTLY
### ACTION 1: Ask Data Source Question
USE the AskUserQuestion tool RIGHT NOW to ask question 1.
DO NOT PROCEED until you execute this tool call:
```
AskUserQuestion with:
- Question: "What data source are you ingesting from?"
- Header: "Data Source"
- Options:
* Klaviyo (API-based connector)
* Shopify (E-commerce platform)
* Salesforce (CRM system)
* Custom API (REST-based)
```
STOP. EXECUTE THIS TOOL NOW. DO NOT READ FURTHER UNTIL COMPLETE.
---
### ACTION 2: Ask Ingestion Mode Question
**CHECKPOINT**: Did you get an answer to question 1? If NO, go back to ACTION 1.
NOW ask question 2 using AskUserQuestion tool:
```
AskUserQuestion with:
- Question: "What ingestion mode do you need?"
- Header: "Mode"
- Options:
* Both (historical + incremental) - Recommended for complete setup
* Incremental only - Ongoing sync only
* Historical only - One-time backfill
```
STOP. EXECUTE THIS TOOL NOW. DO NOT READ FURTHER UNTIL COMPLETE.
---
### ACTION 3: Ask Tables/Objects
**CHECKPOINT**: Did you get an answer to question 2? If NO, go back to ACTION 2.
This is a free-text question. Tell the user:
"Please provide the table or object names to ingest (comma-separated)."
Example: `orders, customers, products`
WAIT for user response. DO NOT PROCEED.
---
### ACTION 4: Ask Incremental Field (CONDITIONAL)
**CHECKPOINT**:
- Did user select "Incremental only" OR "Both" in question 2?
- YES → Ask this question
- NO → Skip to ACTION 6
NOW ask question 4 using AskUserQuestion tool:
```
AskUserQuestion with:
- Question: "What field tracks record updates?"
- Header: "Incremental Field"
- Options:
* updated_at (Timestamp field)
* modified_date (Date field)
* last_modified_time (Datetime field)
```
STOP. EXECUTE THIS TOOL NOW. DO NOT READ FURTHER UNTIL COMPLETE.
---
### ACTION 5: Ask Start Date (CONDITIONAL)
**CHECKPOINT**:
- Did user select "Incremental only" OR "Both" in question 2?
- YES → Ask this question
- NO → Skip to ACTION 6
This is a free-text question. Tell the user:
"What is the initial load start date?"
Format: `YYYY-MM-DDTHH:mm:ss.000000`
Example: `2024-01-01T00:00:00.000000`
WAIT for user response. DO NOT PROCEED.
---
### ACTION 6: Ask Target Database
**CHECKPOINT**: Did you complete all previous questions? If NO, go back.
NOW ask question 6 using AskUserQuestion tool:
```
AskUserQuestion with:
- Question: "Which target database should data be loaded into?"
- Header: "Target DB"
- Options:
* mck_src (Standard client database)
* Custom database (Specify custom name)
```
STOP. EXECUTE THIS TOOL NOW. DO NOT READ FURTHER UNTIL COMPLETE.
---
### ACTION 7: Ask Authentication
**CHECKPOINT**: Did you get an answer to question 6? If NO, go back to ACTION 6.
NOW ask question 7 using AskUserQuestion tool:
```
AskUserQuestion with:
- Question: "What type of authentication is required?"
- Header: "Auth Type"
- Options:
* API Key (Single key authentication)
* OAuth Token (Token-based auth)
* Service Account JSON (Google/Cloud auth)
* Username & Password (Basic auth)
```
STOP. EXECUTE THIS TOOL NOW. DO NOT READ FURTHER UNTIL COMPLETE.
---
### ACTION 8: Show Configuration Summary
**CHECKPOINT**: Have you collected ALL required parameters? If NO, go back and complete missing questions.
NOW display the configuration summary:
```
📋 Configuration Summary:
Source: {source_name}
Connector Type: {connector_type}
Ingestion Mode: {mode}
Tables/Objects: {objects}
Target Database: {target_database}
[If applicable] Incremental Field: {field}
[If applicable] Start Date: {date}
Authentication: {auth_type}
```
ASK: "Does this configuration look correct? Type 'yes' to proceed with generation."
WAIT for user confirmation. DO NOT PROCEED until user types "yes".
---
## WORKFLOW GENERATION (ONLY AFTER USER CONFIRMS)
**CHECKPOINT**: Did user confirm with "yes"? If NO, STOP and wait.
### Step 1: Read Documentation Templates
Read these files in this EXACT order:
1. `docs/sources/template-new-source.md`
2. `docs/patterns/workflow-patterns.md`
3. `docs/patterns/logging-patterns.md`
4. `docs/patterns/timestamp-formats.md`
5. `docs/patterns/incremental-patterns.md`
Check if source-specific template exists:
- `docs/sources/{source-name}.md` (e.g., `docs/sources/klaviyo.md`)
### Step 2: Generate Files (ALL IN ONE RESPONSE)
Use multiple Write tool calls in a SINGLE message to create:
#### For "Incremental only" mode:
1. `ingestion/{source}_ingest_inc.dig`
2. `ingestion/config/{source}_datasources.yml`
3. `ingestion/config/{source}_{object1}_load.yml`
4. `ingestion/config/{source}_{object2}_load.yml` (if multiple objects)
#### For "Historical only" mode:
1. `ingestion/{source}_ingest_hist.dig`
2. `ingestion/config/{source}_datasources.yml`
3. `ingestion/config/{source}_{object}_load.yml`
#### For "Both" mode:
1. `ingestion/{source}_ingest_hist.dig`
2. `ingestion/{source}_ingest_inc.dig`
3. `ingestion/config/{source}_datasources.yml`
4. `ingestion/config/{source}_{object}_load.yml` (per object)
### Step 3: Template Rules (MANDATORY)
- Copy templates EXACTLY character-for-character
- NO simplification, NO optimization, NO improvements
- ONLY replace placeholders: `{source_name}`, `{object_name}`, `{database}`, `{connector_type}`
- Keep ALL logging blocks
- Keep ALL error handling blocks
- Keep ALL timestamp functions
### Step 4: Quality Verification
Before showing output to user, verify:
- ✅ All template sections present
- ✅ All logging blocks included (start, success, error)
- ✅ All error handling blocks present
- ✅ Timestamp format matches connector type
- ✅ Incremental field handling correct
- ✅ No deviations from template
---
## Post-Generation Instructions
After successfully creating all files, show the user:
### Next Steps:
1. **Upload credentials**:
```bash
cd ingestion
td wf secrets --project ingestion --set @credentials_ingestion.json
```
2. **Test workflow syntax**:
```bash
td wf check {source}_ingest_inc.dig
```
3. **Deploy to Treasure Data**:
```bash
td wf push ingestion
```
4. **Run the workflow**:
```bash
td wf start ingestion {source}_ingest_inc --session now
```
5. **Monitor ingestion log**:
```sql
SELECT * FROM {target_database}.ingestion_log
WHERE source_name = '{source}'
ORDER BY time DESC
LIMIT 10
```
---
## ERROR RECOVERY
IF you did NOT use AskUserQuestion tool for each question:
- Print: "❌ ERROR: I failed to follow the interactive collection process."
- Print: "🔄 Restarting from ACTION 1..."
- GO BACK to ACTION 1 and start over
IF user says "skip questions" or "just ask all at once":
- Print: "❌ Cannot skip interactive collection - this ensures accuracy and prevents errors."
- Print: "✅ I'll collect parameters one at a time to ensure we get the configuration right."
- PROCEED with ACTION 1
---
**NOW BEGIN: Execute ACTION 1 immediately. Use AskUserQuestion tool for the first question.**

View File

@@ -0,0 +1,255 @@
---
name: ingest-validate-wf
description: Validate Digdag workflow and configuration files against production quality gates
---
# Validate Ingestion Workflow
## ⚠️ CRITICAL: This validates against strict production quality gates
I'll validate your ingestion workflow for compliance with production standards and best practices.
---
## What I'll Validate
### Quality Gates (ALL MUST PASS)
#### 1. Template Compliance
- ✅ Code matches documented templates 100%
- ✅ No unauthorized deviations from patterns
- ✅ All template sections present
- ✅ Exact formatting and structure
#### 2. Logging Requirements
- ✅ Start logging before data processing
- ✅ Success logging after td_load
- ✅ Error logging in `_error` blocks
- ✅ Minimum 3 logging blocks per data source
- ✅ Correct SQL template usage
#### 3. Error Handling
-`_error:` blocks present in all workflows
- ✅ Error logging with SQL present
- ✅ Proper error message capture
- ✅ Job ID and URL captured in errors
#### 4. Timestamp Format
- ✅ Correct format for connector type:
- Google BigQuery: SQL Server format (`CONVERT(varchar, ..., 121)`)
- Klaviyo: `.000000` (6 decimals, NO Z)
- OneTrust: `.000Z` (3 decimals, WITH Z)
- Shopify v2: ISO 8601
- ✅ Matches `docs/patterns/timestamp-formats.md`
#### 5. Incremental Field Handling
- ✅ Correct field names (table vs. API)
- ✅ Dual field handling where needed (Klaviyo campaigns)
- ✅ Proper COALESCE fallback logic
- ✅ Matches `docs/patterns/incremental-patterns.md`
#### 6. Workflow Structure
- ✅ Must match `docs/patterns/workflow-patterns.md`
- ✅ Proper timezone declaration (`timezone: UTC`)
- ✅ Correct `_export` includes
- ✅ Proper task naming conventions
- ✅ Correct file organization
- ✅ Parallel processing limits appropriate for source
#### 7. Configuration Files
- ✅ YAML syntax validity
- ✅ Secret references (`${secret:name}`) used correctly
- ✅ No hardcoded credentials
- ✅ Required parameters present
- ✅ Database references correct
- ✅ Mode set appropriately (`append`, `replace`)
#### 8. File Organization
-`.dig` files in `ingestion/` directory
- ✅ YAML configs in `ingestion/config/` subdirectory
- ✅ SQL files in `ingestion/sql/` subdirectory
- ✅ Proper file naming conventions
#### 9. Security
- ✅ No hardcoded credentials in any file
- ✅ Proper `${secret:name}` syntax usage
-`credentials_ingestion.json` NOT in version control
-`.gitignore` includes credentials file
---
## Validation Options
### Option 1: Validate Specific Workflow
Provide:
- **Workflow file path**: e.g., `ingestion/klaviyo_ingest_inc.dig`
- **Related config files**: (or I'll find them automatically)
I will:
1. Read the workflow file
2. Find all related config files
3. Check against ALL quality gates
4. Report detailed findings with line numbers
### Option 2: Validate Entire Source
Provide:
- **Source name**: e.g., `klaviyo`, `shopify_v2`, `google_bigquery`
I will:
1. Find all workflows for the source
2. Find all config files for the source
3. Validate against source-specific documentation
4. Check all quality gates
5. Report comprehensive findings
### Option 3: Validate All
Say: **"validate all"**
I will:
1. Find all workflows in `ingestion/`
2. Find all configs in `ingestion/config/`
3. Validate each against its source documentation
4. Check all quality gates
5. Report full project compliance status
---
## Validation Process
### Step 1: Read Documentation
I will read relevant documentation to verify compliance:
- Source-specific docs: `docs/sources/{source-name}.md`
- Pattern docs: `docs/patterns/*.md`
### Step 2: Load Files
I will read all specified workflow and config files
### Step 3: Check Quality Gates
I will verify each file against ALL quality gates listed above
### Step 4: Report Findings
#### Pass Report (if all gates pass)
```
✅ VALIDATION PASSED
Workflow: ingestion/{source}_ingest_inc.dig
Source: {source}
Quality Gates: 9/9 PASSED
✅ Template Compliance
✅ Logging Requirements
✅ Error Handling
✅ Timestamp Format
✅ Incremental Fields
✅ Workflow Structure
✅ Configuration Files
✅ File Organization
✅ Security
No issues found. Workflow is production-ready.
```
#### Fail Report (if any gate fails)
```
❌ VALIDATION FAILED
Workflow: ingestion/{source}_ingest_inc.dig
Source: {source}
Quality Gates: 6/9 PASSED
✅ Template Compliance
✅ Logging Requirements
❌ Error Handling - FAILED
- Missing _error block in main workflow
- Error logging SQL not found
✅ Timestamp Format
❌ Incremental Fields - FAILED
- Using wrong field name: 'updated_at' should be 'updated' for API
- Line 45: incremental_field parameter incorrect
✅ Workflow Structure
✅ Configuration Files
✅ File Organization
❌ Security - FAILED
- Hardcoded API key found in config/klaviyo_profiles_load.yml:12
- Should use ${secret:klaviyo_api_key}
RECOMMENDATIONS:
1. Add _error block to main workflow (see docs/patterns/workflow-patterns.md)
2. Fix incremental field name (see docs/sources/klaviyo.md)
3. Replace hardcoded credential with secret reference
Re-validate after fixing issues.
```
---
## Common Issues Detected
### Template Violations
- Simplified or "optimized" templates
- Removed "redundant" sections
- Modified variable names
- Changed structure
### Logging Violations
- Missing start/success/error logging
- Incorrect SQL template usage
- Missing job ID or URL capture
### Timestamp Format Errors
- Wrong decimal count
- Missing or incorrect timezone marker
- Using default instead of connector-specific format
### Incremental Field Errors
- Using table field name in API parameter
- Using API field name in SQL queries
- Missing COALESCE fallback
### Security Issues
- Hardcoded credentials
- Incorrect secret syntax
- Credentials file in version control
---
## Next Steps After Validation
### If Validation Passes
✅ Workflow is production-ready
- Deploy with confidence
- Monitor ingestion_log for ongoing health
### If Validation Fails
❌ Fix reported issues:
1. Re-read relevant documentation
2. Apply exact templates
3. Fix specific line numbers mentioned
4. Re-validate until all gates pass
**DO NOT deploy failing workflows to production**
---
## Production Quality Assurance
This validation ensures:
- ✅ Code works the first time
- ✅ Consistent patterns across sources
- ✅ Complete error handling and logging
- ✅ Maintainable and documented code
- ✅ No security vulnerabilities
- ✅ Compliance with team standards
---
**What would you like to validate?**
Options:
1. Validate specific workflow: Provide workflow file path
2. Validate entire source: Provide source name
3. Validate all: Say "validate all"

61
plugin.lock.json Normal file
View File

@@ -0,0 +1,61 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:treasure-data/aps_claude_tools:plugins/cdp-ingestion",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "c6f8630c81bc4fab62c48e7f40f3319fb00ead6e",
"treeHash": "acc77579b4afab6d640892a9d5429ceec09cf15052abe800f95c62a224900333",
"generatedAt": "2025-11-28T10:28:44.113752Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "cdp-ingestion",
"description": "Create the ingestion workflow pipeline",
"version": null
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "cf97524935bed96588dfc1cf2d7c718c0e65d987eeb97b08f019245dd13d8c17"
},
{
"path": "agents/cdp-ingestion-expert.md",
"sha256": "c108533216f4f12ce1501974b68e12a93305c1a78e684c90bfc742d3792aadab"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "d1245ed49612f00baa67a971ba1bd0b5c1d98386a069d47cc7e3a96ec215a924"
},
{
"path": "commands/ingest-add-object.md",
"sha256": "43de4ece9ecfef696ebfe931e47eda8516ddb31109d5f7e21d780663114f0624"
},
{
"path": "commands/ingest-new.md",
"sha256": "683f3184622a1608c66ba5c9eac850fe5ae8742bcefa628e97a18eb36b71b7a7"
},
{
"path": "commands/ingest-add-klaviyo.md",
"sha256": "247cf2b7b88faa89bb06a9306d8346b363776f439fbb1ed259234f09e712bba7"
},
{
"path": "commands/ingest-validate-wf.md",
"sha256": "4a267d5cde848f38a691439a7ee3a2b8fc53d03a2e0544bc2ec320c635155c02"
}
],
"dirSha256": "acc77579b4afab6d640892a9d5429ceec09cf15052abe800f95c62a224900333"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}