Initial commit
This commit is contained in:
308
skills/trino/SKILL.md
Normal file
308
skills/trino/SKILL.md
Normal file
@@ -0,0 +1,308 @@
|
||||
---
|
||||
name: trino
|
||||
description: Expert assistance for writing, analyzing, and optimizing Trino SQL queries for Treasure Data. Use this skill when users need help with Trino queries, performance optimization, or TD-specific SQL patterns.
|
||||
---
|
||||
|
||||
# Trino SQL Expert
|
||||
|
||||
Expert assistance for writing and optimizing Trino SQL queries for Treasure Data environments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Writing new Trino SQL queries for TD
|
||||
- Optimizing existing Trino queries for performance
|
||||
- Debugging Trino query errors or issues
|
||||
- Converting queries from other SQL dialects to Trino
|
||||
- Implementing TD best practices for data processing
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. TD Table Naming Conventions
|
||||
|
||||
Always use the TD table format:
|
||||
```sql
|
||||
SELECT * FROM database_name.table_name
|
||||
```
|
||||
|
||||
### 2. Partitioning and Time-based Queries
|
||||
|
||||
TD tables are typically partitioned by time. Always include time filters for performance:
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM database_name.table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
```
|
||||
|
||||
Or use relative time ranges:
|
||||
```sql
|
||||
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME())
|
||||
```
|
||||
|
||||
### 3. Performance Optimization
|
||||
|
||||
**Use APPROX functions for large datasets:**
|
||||
```sql
|
||||
SELECT
|
||||
APPROX_DISTINCT(user_id) as unique_users,
|
||||
APPROX_PERCENTILE(response_time, 0.95) as p95_response
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
**Partition pruning:**
|
||||
```sql
|
||||
-- Good: Filters on partition column
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02')
|
||||
|
||||
-- Avoid: Non-partition column filters without time filter
|
||||
WHERE event_type = 'click' -- Missing time filter!
|
||||
```
|
||||
|
||||
**Limit data scanned:**
|
||||
```sql
|
||||
-- Use LIMIT for exploratory queries
|
||||
SELECT * FROM table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
LIMIT 1000
|
||||
```
|
||||
|
||||
### 4. Common TD Functions
|
||||
|
||||
**TD_INTERVAL** - Simplified relative time filtering (Recommended):
|
||||
```sql
|
||||
-- Current day
|
||||
WHERE TD_INTERVAL(time, '1d', 'JST')
|
||||
|
||||
-- Yesterday
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Previous week
|
||||
WHERE TD_INTERVAL(time, '-1w', 'JST')
|
||||
|
||||
-- Previous month
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
|
||||
-- 2 days ago (offset syntax)
|
||||
WHERE TD_INTERVAL(time, '-1d/-1d', 'JST')
|
||||
|
||||
-- 3 months ago (combined offset)
|
||||
WHERE TD_INTERVAL(time, '-1M/-2M', 'JST')
|
||||
```
|
||||
|
||||
**Note:** TD_INTERVAL simplifies relative time queries and is preferred over combining TD_TIME_RANGE with TD_DATE_TRUNC. Cannot accept TD_SCHEDULED_TIME as first argument, but including TD_SCHEDULED_TIME elsewhere in the query establishes the reference date.
|
||||
|
||||
**TD_TIME_RANGE** - Filter by time partitions (explicit dates):
|
||||
```sql
|
||||
TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
TD_TIME_RANGE(time, '2024-01-01') -- Single day
|
||||
```
|
||||
|
||||
**TD_SCHEDULED_TIME()** - Get scheduled execution time:
|
||||
```sql
|
||||
TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d') -- Yesterday
|
||||
```
|
||||
|
||||
**TD_TIME_STRING** - Format timestamps (Recommended):
|
||||
```sql
|
||||
-- Uses simple format codes instead of full format strings
|
||||
TD_TIME_STRING(time, 'd!', 'JST') -- Returns: 2018-09-13
|
||||
TD_TIME_STRING(time, 's!', 'UTC') -- Returns: 2018-09-13 16:45:34
|
||||
TD_TIME_STRING(time, 'M!', 'JST') -- Returns: 2018-09 (year-month)
|
||||
TD_TIME_STRING(time, 'h!', 'UTC') -- Returns: 2018-09-13 16 (year-month-day hour)
|
||||
|
||||
-- With timezone in output (without ! suffix)
|
||||
TD_TIME_STRING(time, 'd', 'JST') -- Returns: 2018-09-13 00:00:00+0900
|
||||
TD_TIME_STRING(time, 's', 'UTC') -- Returns: 2018-09-13 16:45:34+0000
|
||||
```
|
||||
|
||||
**Format codes:**
|
||||
- `y!` = yyyy (year only)
|
||||
- `q!` = yyyy-MM (quarter start)
|
||||
- `M!` = yyyy-MM (month)
|
||||
- `w!` = yyyy-MM-dd (week start)
|
||||
- `d!` = yyyy-MM-dd (day)
|
||||
- `h!` = yyyy-MM-dd HH (hour)
|
||||
- `m!` = yyyy-MM-dd HH:mm (minute)
|
||||
- `s!` = yyyy-MM-dd HH:mm:ss (second)
|
||||
- Without the exclamation mark suffix, timezone offset is included
|
||||
|
||||
**TD_TIME_FORMAT** - Format timestamps (Legacy, use TD_TIME_STRING instead):
|
||||
```sql
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'UTC')
|
||||
```
|
||||
|
||||
**TD_SESSIONIZE** - Sessionize events:
|
||||
```sql
|
||||
SELECT TD_SESSIONIZE(time, 1800, user_id) as session_id
|
||||
FROM events
|
||||
```
|
||||
|
||||
### 5. JOIN Optimization
|
||||
|
||||
**Put smaller table on the right side:**
|
||||
```sql
|
||||
-- Good
|
||||
SELECT *
|
||||
FROM large_table l
|
||||
JOIN small_table s ON l.id = s.id
|
||||
|
||||
-- Consider table size when joining
|
||||
```
|
||||
|
||||
**Use appropriate JOIN types:**
|
||||
```sql
|
||||
-- INNER JOIN for matching records only
|
||||
-- LEFT JOIN when you need all records from left table
|
||||
-- Avoid FULL OUTER JOIN when possible (expensive)
|
||||
```
|
||||
|
||||
### 6. Data Types and Casting
|
||||
|
||||
Be explicit with data types:
|
||||
```sql
|
||||
CAST(column_name AS BIGINT)
|
||||
CAST(column_name AS VARCHAR)
|
||||
CAST(column_name AS DOUBLE)
|
||||
TRY_CAST(column_name AS BIGINT) -- Returns NULL on failure
|
||||
```
|
||||
|
||||
### 7. Window Functions
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
event_time,
|
||||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as event_seq,
|
||||
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
|
||||
FROM events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### User Event Analysis
|
||||
```sql
|
||||
-- Using TD_INTERVAL for last month
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
event_type,
|
||||
COUNT(*) as event_count,
|
||||
APPROX_DISTINCT(user_id) as unique_users
|
||||
FROM database_name.events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
AND event_type IN ('page_view', 'click', 'purchase')
|
||||
GROUP BY 1, 2
|
||||
ORDER BY 1, 2
|
||||
```
|
||||
|
||||
**Alternative with explicit date range:**
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
event_type,
|
||||
COUNT(*) as event_count,
|
||||
APPROX_DISTINCT(user_id) as unique_users
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
AND event_type IN ('page_view', 'click', 'purchase')
|
||||
GROUP BY 1, 2
|
||||
ORDER BY 1, 2
|
||||
```
|
||||
|
||||
### Conversion Funnel
|
||||
```sql
|
||||
WITH events_filtered AS (
|
||||
SELECT
|
||||
user_id,
|
||||
event_type,
|
||||
time
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
)
|
||||
SELECT
|
||||
COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END) as step1_users,
|
||||
COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN user_id END) as step2_users,
|
||||
COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) as step3_users
|
||||
FROM events_filtered
|
||||
```
|
||||
|
||||
### Daily Aggregation
|
||||
```sql
|
||||
-- Using TD_INTERVAL for yesterday's data
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
COUNT(*) as total_events,
|
||||
APPROX_DISTINCT(user_id) as daily_active_users,
|
||||
AVG(session_duration) as avg_session_duration
|
||||
FROM database_name.events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
AND TD_SCHEDULED_TIME() IS NOT NULL -- Establishes reference date for TD_INTERVAL
|
||||
GROUP BY 1
|
||||
ORDER BY 1
|
||||
```
|
||||
|
||||
**For rolling 30-day window:**
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
COUNT(*) as total_events,
|
||||
APPROX_DISTINCT(user_id) as daily_active_users,
|
||||
AVG(session_duration) as avg_session_duration
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-30d'), TD_SCHEDULED_TIME())
|
||||
GROUP BY 1
|
||||
ORDER BY 1
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
**"Line X:Y: Column 'time' cannot be resolved"**
|
||||
- Ensure table name is correct
|
||||
- Check that column exists in table schema
|
||||
|
||||
**"Query exceeded memory limit"**
|
||||
- Add time filters with TD_TIME_RANGE
|
||||
- Use APPROX_ functions instead of exact aggregations
|
||||
- Reduce JOIN complexity or data volume
|
||||
|
||||
**"Partition not found"**
|
||||
- Verify time range covers existing partitions
|
||||
- Check TD_TIME_RANGE syntax
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always include time filters** using TD_INTERVAL or TD_TIME_RANGE for partition pruning
|
||||
- Use TD_INTERVAL for relative dates: `WHERE TD_INTERVAL(time, '-1d', 'JST')`
|
||||
- Use TD_TIME_RANGE for explicit dates: `WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')`
|
||||
- Never filter by formatted dates: ❌ `WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'`
|
||||
2. **Use TD_TIME_STRING for display only**, not for filtering
|
||||
- ✅ `SELECT TD_TIME_STRING(time, 'd!', 'JST') as date`
|
||||
- ❌ `WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'`
|
||||
3. **Use APPROX functions** for large-scale aggregations (APPROX_DISTINCT, APPROX_PERCENTILE)
|
||||
4. **Limit exploratory queries** to reduce costs and scan time
|
||||
5. **Test queries on small time ranges** before running on full dataset
|
||||
6. **Use CTEs (WITH clauses)** for complex queries to improve readability
|
||||
7. **Add comments** explaining business logic
|
||||
8. **Consider materialized results** for frequently-run queries
|
||||
|
||||
## Example Workflow
|
||||
|
||||
When helping users write Trino queries:
|
||||
|
||||
1. **Understand the requirement** - What data do they need?
|
||||
2. **Identify tables** - Which TD tables contain the data?
|
||||
3. **Add time filters** - What time range is needed?
|
||||
4. **Write base query** - Start with simple SELECT
|
||||
5. **Add aggregations** - Use appropriate functions
|
||||
6. **Optimize** - Apply performance best practices
|
||||
7. **Test** - Validate results on small dataset first
|
||||
|
||||
## Resources
|
||||
|
||||
- Trino SQL documentation: https://trino.io/docs/current/
|
||||
- TD-specific functions: Check internal TD documentation
|
||||
- Query performance: Use EXPLAIN for query plans
|
||||
Reference in New Issue
Block a user