Files
gh-treasure-data-td-skills-…/skills/trino/SKILL.md
2025-11-30 09:02:59 +08:00

309 lines
8.7 KiB
Markdown

---
name: trino
description: Expert assistance for writing, analyzing, and optimizing Trino SQL queries for Treasure Data. Use this skill when users need help with Trino queries, performance optimization, or TD-specific SQL patterns.
---
# Trino SQL Expert
Expert assistance for writing and optimizing Trino SQL queries for Treasure Data environments.
## When to Use This Skill
Use this skill when:
- Writing new Trino SQL queries for TD
- Optimizing existing Trino queries for performance
- Debugging Trino query errors or issues
- Converting queries from other SQL dialects to Trino
- Implementing TD best practices for data processing
## Core Principles
### 1. TD Table Naming Conventions
Always use the TD table format:
```sql
SELECT * FROM database_name.table_name
```
### 2. Partitioning and Time-based Queries
TD tables are typically partitioned by time. Always include time filters for performance:
```sql
SELECT *
FROM database_name.table_name
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
```
Or use relative time ranges:
```sql
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME())
```
### 3. Performance Optimization
**Use APPROX functions for large datasets:**
```sql
SELECT
APPROX_DISTINCT(user_id) as unique_users,
APPROX_PERCENTILE(response_time, 0.95) as p95_response
FROM database_name.events
WHERE TD_TIME_RANGE(time, '2024-01-01')
```
**Partition pruning:**
```sql
-- Good: Filters on partition column
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02')
-- Avoid: Non-partition column filters without time filter
WHERE event_type = 'click' -- Missing time filter!
```
**Limit data scanned:**
```sql
-- Use LIMIT for exploratory queries
SELECT * FROM table_name
WHERE TD_TIME_RANGE(time, '2024-01-01')
LIMIT 1000
```
### 4. Common TD Functions
**TD_INTERVAL** - Simplified relative time filtering (Recommended):
```sql
-- Current day
WHERE TD_INTERVAL(time, '1d', 'JST')
-- Yesterday
WHERE TD_INTERVAL(time, '-1d', 'JST')
-- Previous week
WHERE TD_INTERVAL(time, '-1w', 'JST')
-- Previous month
WHERE TD_INTERVAL(time, '-1M', 'JST')
-- 2 days ago (offset syntax)
WHERE TD_INTERVAL(time, '-1d/-1d', 'JST')
-- 3 months ago (combined offset)
WHERE TD_INTERVAL(time, '-1M/-2M', 'JST')
```
**Note:** TD_INTERVAL simplifies relative time queries and is preferred over combining TD_TIME_RANGE with TD_DATE_TRUNC. Cannot accept TD_SCHEDULED_TIME as first argument, but including TD_SCHEDULED_TIME elsewhere in the query establishes the reference date.
**TD_TIME_RANGE** - Filter by time partitions (explicit dates):
```sql
TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
TD_TIME_RANGE(time, '2024-01-01') -- Single day
```
**TD_SCHEDULED_TIME()** - Get scheduled execution time:
```sql
TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d') -- Yesterday
```
**TD_TIME_STRING** - Format timestamps (Recommended):
```sql
-- Uses simple format codes instead of full format strings
TD_TIME_STRING(time, 'd!', 'JST') -- Returns: 2018-09-13
TD_TIME_STRING(time, 's!', 'UTC') -- Returns: 2018-09-13 16:45:34
TD_TIME_STRING(time, 'M!', 'JST') -- Returns: 2018-09 (year-month)
TD_TIME_STRING(time, 'h!', 'UTC') -- Returns: 2018-09-13 16 (year-month-day hour)
-- With timezone in output (without ! suffix)
TD_TIME_STRING(time, 'd', 'JST') -- Returns: 2018-09-13 00:00:00+0900
TD_TIME_STRING(time, 's', 'UTC') -- Returns: 2018-09-13 16:45:34+0000
```
**Format codes:**
- `y!` = yyyy (year only)
- `q!` = yyyy-MM (quarter start)
- `M!` = yyyy-MM (month)
- `w!` = yyyy-MM-dd (week start)
- `d!` = yyyy-MM-dd (day)
- `h!` = yyyy-MM-dd HH (hour)
- `m!` = yyyy-MM-dd HH:mm (minute)
- `s!` = yyyy-MM-dd HH:mm:ss (second)
- Without the exclamation mark suffix, timezone offset is included
**TD_TIME_FORMAT** - Format timestamps (Legacy, use TD_TIME_STRING instead):
```sql
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'UTC')
```
**TD_SESSIONIZE** - Sessionize events:
```sql
SELECT TD_SESSIONIZE(time, 1800, user_id) as session_id
FROM events
```
### 5. JOIN Optimization
**Put smaller table on the right side:**
```sql
-- Good
SELECT *
FROM large_table l
JOIN small_table s ON l.id = s.id
-- Consider table size when joining
```
**Use appropriate JOIN types:**
```sql
-- INNER JOIN for matching records only
-- LEFT JOIN when you need all records from left table
-- Avoid FULL OUTER JOIN when possible (expensive)
```
### 6. Data Types and Casting
Be explicit with data types:
```sql
CAST(column_name AS BIGINT)
CAST(column_name AS VARCHAR)
CAST(column_name AS DOUBLE)
TRY_CAST(column_name AS BIGINT) -- Returns NULL on failure
```
### 7. Window Functions
```sql
SELECT
user_id,
event_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as event_seq,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01')
```
## Common Patterns
### User Event Analysis
```sql
-- Using TD_INTERVAL for last month
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
event_type,
COUNT(*) as event_count,
APPROX_DISTINCT(user_id) as unique_users
FROM database_name.events
WHERE TD_INTERVAL(time, '-1M', 'JST')
AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY 1, 2
ORDER BY 1, 2
```
**Alternative with explicit date range:**
```sql
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
event_type,
COUNT(*) as event_count,
APPROX_DISTINCT(user_id) as unique_users
FROM database_name.events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY 1, 2
ORDER BY 1, 2
```
### Conversion Funnel
```sql
WITH events_filtered AS (
SELECT
user_id,
event_type,
time
FROM database_name.events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
)
SELECT
COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END) as step1_users,
COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN user_id END) as step2_users,
COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) as step3_users
FROM events_filtered
```
### Daily Aggregation
```sql
-- Using TD_INTERVAL for yesterday's data
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
COUNT(*) as total_events,
APPROX_DISTINCT(user_id) as daily_active_users,
AVG(session_duration) as avg_session_duration
FROM database_name.events
WHERE TD_INTERVAL(time, '-1d', 'JST')
AND TD_SCHEDULED_TIME() IS NOT NULL -- Establishes reference date for TD_INTERVAL
GROUP BY 1
ORDER BY 1
```
**For rolling 30-day window:**
```sql
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
COUNT(*) as total_events,
APPROX_DISTINCT(user_id) as daily_active_users,
AVG(session_duration) as avg_session_duration
FROM database_name.events
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-30d'), TD_SCHEDULED_TIME())
GROUP BY 1
ORDER BY 1
```
## Error Handling
### Common Errors
**"Line X:Y: Column 'time' cannot be resolved"**
- Ensure table name is correct
- Check that column exists in table schema
**"Query exceeded memory limit"**
- Add time filters with TD_TIME_RANGE
- Use APPROX_ functions instead of exact aggregations
- Reduce JOIN complexity or data volume
**"Partition not found"**
- Verify time range covers existing partitions
- Check TD_TIME_RANGE syntax
## Best Practices
1. **Always include time filters** using TD_INTERVAL or TD_TIME_RANGE for partition pruning
- Use TD_INTERVAL for relative dates: `WHERE TD_INTERVAL(time, '-1d', 'JST')`
- Use TD_TIME_RANGE for explicit dates: `WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')`
- Never filter by formatted dates: ❌ `WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'`
2. **Use TD_TIME_STRING for display only**, not for filtering
-`SELECT TD_TIME_STRING(time, 'd!', 'JST') as date`
-`WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'`
3. **Use APPROX functions** for large-scale aggregations (APPROX_DISTINCT, APPROX_PERCENTILE)
4. **Limit exploratory queries** to reduce costs and scan time
5. **Test queries on small time ranges** before running on full dataset
6. **Use CTEs (WITH clauses)** for complex queries to improve readability
7. **Add comments** explaining business logic
8. **Consider materialized results** for frequently-run queries
## Example Workflow
When helping users write Trino queries:
1. **Understand the requirement** - What data do they need?
2. **Identify tables** - Which TD tables contain the data?
3. **Add time filters** - What time range is needed?
4. **Write base query** - Start with simple SELECT
5. **Add aggregations** - Use appropriate functions
6. **Optimize** - Apply performance best practices
7. **Test** - Validate results on small dataset first
## Resources
- Trino SQL documentation: https://trino.io/docs/current/
- TD-specific functions: Check internal TD documentation
- Query performance: Use EXPLAIN for query plans