--- name: trino description: Expert assistance for writing, analyzing, and optimizing Trino SQL queries for Treasure Data. Use this skill when users need help with Trino queries, performance optimization, or TD-specific SQL patterns. --- # Trino SQL Expert Expert assistance for writing and optimizing Trino SQL queries for Treasure Data environments. ## When to Use This Skill Use this skill when: - Writing new Trino SQL queries for TD - Optimizing existing Trino queries for performance - Debugging Trino query errors or issues - Converting queries from other SQL dialects to Trino - Implementing TD best practices for data processing ## Core Principles ### 1. TD Table Naming Conventions Always use the TD table format: ```sql SELECT * FROM database_name.table_name ``` ### 2. Partitioning and Time-based Queries TD tables are typically partitioned by time. Always include time filters for performance: ```sql SELECT * FROM database_name.table_name WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31') ``` Or use relative time ranges: ```sql WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME()) ``` ### 3. Performance Optimization **Use APPROX functions for large datasets:** ```sql SELECT APPROX_DISTINCT(user_id) as unique_users, APPROX_PERCENTILE(response_time, 0.95) as p95_response FROM database_name.events WHERE TD_TIME_RANGE(time, '2024-01-01') ``` **Partition pruning:** ```sql -- Good: Filters on partition column WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02') -- Avoid: Non-partition column filters without time filter WHERE event_type = 'click' -- Missing time filter! ``` **Limit data scanned:** ```sql -- Use LIMIT for exploratory queries SELECT * FROM table_name WHERE TD_TIME_RANGE(time, '2024-01-01') LIMIT 1000 ``` ### 4. Common TD Functions **TD_INTERVAL** - Simplified relative time filtering (Recommended): ```sql -- Current day WHERE TD_INTERVAL(time, '1d', 'JST') -- Yesterday WHERE TD_INTERVAL(time, '-1d', 'JST') -- Previous week WHERE TD_INTERVAL(time, '-1w', 'JST') -- Previous month WHERE TD_INTERVAL(time, '-1M', 'JST') -- 2 days ago (offset syntax) WHERE TD_INTERVAL(time, '-1d/-1d', 'JST') -- 3 months ago (combined offset) WHERE TD_INTERVAL(time, '-1M/-2M', 'JST') ``` **Note:** TD_INTERVAL simplifies relative time queries and is preferred over combining TD_TIME_RANGE with TD_DATE_TRUNC. Cannot accept TD_SCHEDULED_TIME as first argument, but including TD_SCHEDULED_TIME elsewhere in the query establishes the reference date. **TD_TIME_RANGE** - Filter by time partitions (explicit dates): ```sql TD_TIME_RANGE(time, '2024-01-01', '2024-01-31') TD_TIME_RANGE(time, '2024-01-01') -- Single day ``` **TD_SCHEDULED_TIME()** - Get scheduled execution time: ```sql TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d') -- Yesterday ``` **TD_TIME_STRING** - Format timestamps (Recommended): ```sql -- Uses simple format codes instead of full format strings TD_TIME_STRING(time, 'd!', 'JST') -- Returns: 2018-09-13 TD_TIME_STRING(time, 's!', 'UTC') -- Returns: 2018-09-13 16:45:34 TD_TIME_STRING(time, 'M!', 'JST') -- Returns: 2018-09 (year-month) TD_TIME_STRING(time, 'h!', 'UTC') -- Returns: 2018-09-13 16 (year-month-day hour) -- With timezone in output (without ! suffix) TD_TIME_STRING(time, 'd', 'JST') -- Returns: 2018-09-13 00:00:00+0900 TD_TIME_STRING(time, 's', 'UTC') -- Returns: 2018-09-13 16:45:34+0000 ``` **Format codes:** - `y!` = yyyy (year only) - `q!` = yyyy-MM (quarter start) - `M!` = yyyy-MM (month) - `w!` = yyyy-MM-dd (week start) - `d!` = yyyy-MM-dd (day) - `h!` = yyyy-MM-dd HH (hour) - `m!` = yyyy-MM-dd HH:mm (minute) - `s!` = yyyy-MM-dd HH:mm:ss (second) - Without the exclamation mark suffix, timezone offset is included **TD_TIME_FORMAT** - Format timestamps (Legacy, use TD_TIME_STRING instead): ```sql TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'UTC') ``` **TD_SESSIONIZE** - Sessionize events: ```sql SELECT TD_SESSIONIZE(time, 1800, user_id) as session_id FROM events ``` ### 5. JOIN Optimization **Put smaller table on the right side:** ```sql -- Good SELECT * FROM large_table l JOIN small_table s ON l.id = s.id -- Consider table size when joining ``` **Use appropriate JOIN types:** ```sql -- INNER JOIN for matching records only -- LEFT JOIN when you need all records from left table -- Avoid FULL OUTER JOIN when possible (expensive) ``` ### 6. Data Types and Casting Be explicit with data types: ```sql CAST(column_name AS BIGINT) CAST(column_name AS VARCHAR) CAST(column_name AS DOUBLE) TRY_CAST(column_name AS BIGINT) -- Returns NULL on failure ``` ### 7. Window Functions ```sql SELECT user_id, event_time, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as event_seq, LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event FROM events WHERE TD_TIME_RANGE(time, '2024-01-01') ``` ## Common Patterns ### User Event Analysis ```sql -- Using TD_INTERVAL for last month SELECT TD_TIME_STRING(time, 'd!', 'JST') as date, event_type, COUNT(*) as event_count, APPROX_DISTINCT(user_id) as unique_users FROM database_name.events WHERE TD_INTERVAL(time, '-1M', 'JST') AND event_type IN ('page_view', 'click', 'purchase') GROUP BY 1, 2 ORDER BY 1, 2 ``` **Alternative with explicit date range:** ```sql SELECT TD_TIME_STRING(time, 'd!', 'JST') as date, event_type, COUNT(*) as event_count, APPROX_DISTINCT(user_id) as unique_users FROM database_name.events WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31') AND event_type IN ('page_view', 'click', 'purchase') GROUP BY 1, 2 ORDER BY 1, 2 ``` ### Conversion Funnel ```sql WITH events_filtered AS ( SELECT user_id, event_type, time FROM database_name.events WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31') ) SELECT COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END) as step1_users, COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN user_id END) as step2_users, COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) as step3_users FROM events_filtered ``` ### Daily Aggregation ```sql -- Using TD_INTERVAL for yesterday's data SELECT TD_TIME_STRING(time, 'd!', 'JST') as date, COUNT(*) as total_events, APPROX_DISTINCT(user_id) as daily_active_users, AVG(session_duration) as avg_session_duration FROM database_name.events WHERE TD_INTERVAL(time, '-1d', 'JST') AND TD_SCHEDULED_TIME() IS NOT NULL -- Establishes reference date for TD_INTERVAL GROUP BY 1 ORDER BY 1 ``` **For rolling 30-day window:** ```sql SELECT TD_TIME_STRING(time, 'd!', 'JST') as date, COUNT(*) as total_events, APPROX_DISTINCT(user_id) as daily_active_users, AVG(session_duration) as avg_session_duration FROM database_name.events WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-30d'), TD_SCHEDULED_TIME()) GROUP BY 1 ORDER BY 1 ``` ## Error Handling ### Common Errors **"Line X:Y: Column 'time' cannot be resolved"** - Ensure table name is correct - Check that column exists in table schema **"Query exceeded memory limit"** - Add time filters with TD_TIME_RANGE - Use APPROX_ functions instead of exact aggregations - Reduce JOIN complexity or data volume **"Partition not found"** - Verify time range covers existing partitions - Check TD_TIME_RANGE syntax ## Best Practices 1. **Always include time filters** using TD_INTERVAL or TD_TIME_RANGE for partition pruning - Use TD_INTERVAL for relative dates: `WHERE TD_INTERVAL(time, '-1d', 'JST')` - Use TD_TIME_RANGE for explicit dates: `WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')` - Never filter by formatted dates: ❌ `WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'` 2. **Use TD_TIME_STRING for display only**, not for filtering - ✅ `SELECT TD_TIME_STRING(time, 'd!', 'JST') as date` - ❌ `WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'` 3. **Use APPROX functions** for large-scale aggregations (APPROX_DISTINCT, APPROX_PERCENTILE) 4. **Limit exploratory queries** to reduce costs and scan time 5. **Test queries on small time ranges** before running on full dataset 6. **Use CTEs (WITH clauses)** for complex queries to improve readability 7. **Add comments** explaining business logic 8. **Consider materialized results** for frequently-run queries ## Example Workflow When helping users write Trino queries: 1. **Understand the requirement** - What data do they need? 2. **Identify tables** - Which TD tables contain the data? 3. **Add time filters** - What time range is needed? 4. **Write base query** - Start with simple SELECT 5. **Add aggregations** - Use appropriate functions 6. **Optimize** - Apply performance best practices 7. **Test** - Validate results on small dataset first ## Resources - Trino SQL documentation: https://trino.io/docs/current/ - TD-specific functions: Check internal TD documentation - Query performance: Use EXPLAIN for query plans