Initial commit
This commit is contained in:
17
.claude-plugin/plugin.json
Normal file
17
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"name": "sql-skills",
|
||||
"description": "SQL query skills for Treasure Data including Trino and Hive query optimization, Trino CLI for interactive queries, and TD MCP server for Claude Code integration with natural language data exploration",
|
||||
"version": "0.0.0-2025.11.28",
|
||||
"author": {
|
||||
"name": "Treasure Data",
|
||||
"email": "support@treasuredata.com"
|
||||
},
|
||||
"skills": [
|
||||
"./skills/trino",
|
||||
"./skills/hive",
|
||||
"./skills/trino-optimizer",
|
||||
"./skills/trino-to-hive-migration",
|
||||
"./skills/trino-cli",
|
||||
"./skills/td-mcp"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# sql-skills
|
||||
|
||||
SQL query skills for Treasure Data including Trino and Hive query optimization, Trino CLI for interactive queries, and TD MCP server for Claude Code integration with natural language data exploration
|
||||
64
plugin.lock.json
Normal file
64
plugin.lock.json
Normal file
@@ -0,0 +1,64 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:treasure-data/td-skills:sql-skills",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "be11a7b269c7a056be9b0b9ba2a6fc107c021794",
|
||||
"treeHash": "9a82d83f14ed93aea716dd9c7eb152d182c9a30c49a5d26d70783f8189607b1d",
|
||||
"generatedAt": "2025-11-28T10:28:45.593642Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "sql-skills",
|
||||
"description": "SQL query skills for Treasure Data including Trino and Hive query optimization, Trino CLI for interactive queries, and TD MCP server for Claude Code integration with natural language data exploration"
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "a4de22a59c87f2b7162b8f73ad104b3b932dbd2e946ab8fb8d27c9398933bd06"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "a3b736c3ac880bbc43dd9c10b3ab604c35fbdbb8803a515d3d873d1ffc3ec438"
|
||||
},
|
||||
{
|
||||
"path": "skills/trino/SKILL.md",
|
||||
"sha256": "21e71c0b2eaf7df9590b0d433c03e48d91ba712efb025bf28af9c369ab8bd4a2"
|
||||
},
|
||||
{
|
||||
"path": "skills/trino-to-hive-migration/SKILL.md",
|
||||
"sha256": "9a883149fe5ccfbef88485fbb7155160cb841de84c792bd835e017f51ed3409a"
|
||||
},
|
||||
{
|
||||
"path": "skills/trino-cli/SKILL.md",
|
||||
"sha256": "5bb2313a8f4b6027ebe2da30c976a85b3d559f01650b23e7cf1b4ac3b8b2e7ea"
|
||||
},
|
||||
{
|
||||
"path": "skills/hive/SKILL.md",
|
||||
"sha256": "e7fc2b23df674dd6e59e9afe7150a12caa371142dc49fbfe0590963c0655e8e6"
|
||||
},
|
||||
{
|
||||
"path": "skills/trino-optimizer/SKILL.md",
|
||||
"sha256": "fd9facf74d2bb4bd4420a08bf7c3a1a90f03c92e403a67a150b2a328f4cd82f6"
|
||||
},
|
||||
{
|
||||
"path": "skills/td-mcp/SKILL.md",
|
||||
"sha256": "040b952adbe407626eeecbe3cff9660bde6f41bd52a5237c9164fa234c597b3c"
|
||||
}
|
||||
],
|
||||
"dirSha256": "9a82d83f14ed93aea716dd9c7eb152d182c9a30c49a5d26d70783f8189607b1d"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
427
skills/hive/SKILL.md
Normal file
427
skills/hive/SKILL.md
Normal file
@@ -0,0 +1,427 @@
|
||||
---
|
||||
name: hive
|
||||
description: Expert assistance for writing, analyzing, and optimizing Hive SQL queries for Treasure Data. Use this skill when users need help with Hive queries, MapReduce optimization, or legacy TD Hive workflows.
|
||||
---
|
||||
|
||||
# Hive SQL Expert
|
||||
|
||||
Expert assistance for writing and optimizing Hive SQL queries for Treasure Data environments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Writing Hive SQL queries for TD
|
||||
- Maintaining or updating legacy Hive workflows
|
||||
- Optimizing Hive query performance
|
||||
- Converting queries to/from Hive dialect
|
||||
- Working with Hive-specific features (SerDes, UDFs, etc.)
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. TD Table Access
|
||||
|
||||
Access TD tables using database.table notation:
|
||||
```sql
|
||||
SELECT * FROM database_name.table_name
|
||||
```
|
||||
|
||||
### 2. Time-based Partitioning
|
||||
|
||||
TD Hive tables are partitioned by time. Always use time predicates:
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM database_name.table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
```
|
||||
|
||||
Unix timestamp format:
|
||||
```sql
|
||||
WHERE time >= unix_timestamp('2024-01-01 00:00:00')
|
||||
AND time < unix_timestamp('2024-01-02 00:00:00')
|
||||
```
|
||||
|
||||
### 3. Performance Optimization
|
||||
|
||||
**Use columnar formats:**
|
||||
- TD tables are typically stored in columnar format (ORC/Parquet)
|
||||
- Select only needed columns to reduce I/O
|
||||
|
||||
**Partition pruning:**
|
||||
```sql
|
||||
-- Good: Uses partition columns
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02')
|
||||
|
||||
-- Good: Direct time filter
|
||||
WHERE time >= 1704067200 AND time < 1704153600
|
||||
```
|
||||
|
||||
**Limit during development:**
|
||||
```sql
|
||||
SELECT * FROM table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
LIMIT 1000
|
||||
```
|
||||
|
||||
### 4. Common TD Hive Functions
|
||||
|
||||
**TD_INTERVAL** - Simplified relative time filtering (Recommended):
|
||||
```sql
|
||||
-- Current day
|
||||
WHERE TD_INTERVAL(time, '1d', 'JST')
|
||||
|
||||
-- Yesterday
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Previous week
|
||||
WHERE TD_INTERVAL(time, '-1w', 'JST')
|
||||
|
||||
-- Previous month
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
|
||||
-- 2 days ago (offset syntax)
|
||||
WHERE TD_INTERVAL(time, '-1d/-1d', 'JST')
|
||||
|
||||
-- 3 months ago (combined offset)
|
||||
WHERE TD_INTERVAL(time, '-1M/-2M', 'JST')
|
||||
```
|
||||
|
||||
**Note:** TD_INTERVAL simplifies relative time queries and is preferred over combining TD_TIME_RANGE with TD_DATE_TRUNC. Cannot accept TD_SCHEDULED_TIME as first argument, but including TD_SCHEDULED_TIME elsewhere in the query establishes the reference date.
|
||||
|
||||
**TD_TIME_RANGE** - Partition-aware time filtering (explicit dates):
|
||||
```sql
|
||||
TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST') -- Open-ended
|
||||
```
|
||||
|
||||
**TD_SCHEDULED_TIME()** - Get workflow execution time:
|
||||
```sql
|
||||
SELECT TD_SCHEDULED_TIME()
|
||||
-- Returns Unix timestamp of scheduled run
|
||||
```
|
||||
|
||||
**TD_TIME_FORMAT** - Format Unix timestamps:
|
||||
```sql
|
||||
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'JST')
|
||||
```
|
||||
|
||||
**TD_TIME_PARSE** - Parse string to Unix timestamp:
|
||||
```sql
|
||||
SELECT TD_TIME_PARSE('2024-01-01', 'JST')
|
||||
```
|
||||
|
||||
**TD_DATE_TRUNC** - Truncate timestamp to day/hour/etc:
|
||||
```sql
|
||||
SELECT TD_DATE_TRUNC('day', time, 'JST')
|
||||
SELECT TD_DATE_TRUNC('hour', time, 'UTC')
|
||||
```
|
||||
|
||||
### 5. JOIN Optimization
|
||||
|
||||
**MapReduce JOIN strategies:**
|
||||
|
||||
```sql
|
||||
-- Map-side JOIN for small tables (use /*+ MAPJOIN */ hint)
|
||||
SELECT /*+ MAPJOIN(small_table) */
|
||||
l.*,
|
||||
s.attribute
|
||||
FROM large_table l
|
||||
JOIN small_table s ON l.id = s.id
|
||||
WHERE TD_TIME_RANGE(l.time, '2024-01-01')
|
||||
```
|
||||
|
||||
**Reduce-side JOIN:**
|
||||
```sql
|
||||
-- Default for large-to-large joins
|
||||
SELECT *
|
||||
FROM table1 t1
|
||||
JOIN table2 t2 ON t1.key = t2.key
|
||||
WHERE TD_TIME_RANGE(t1.time, '2024-01-01')
|
||||
AND TD_TIME_RANGE(t2.time, '2024-01-01')
|
||||
```
|
||||
|
||||
### 6. Aggregations
|
||||
|
||||
**Standard aggregations:**
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
COUNT(*) as total_count,
|
||||
COUNT(DISTINCT user_id) as unique_users,
|
||||
AVG(value) as avg_value,
|
||||
SUM(amount) as total_amount
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')
|
||||
```
|
||||
|
||||
**Approximate aggregations for large datasets:**
|
||||
```sql
|
||||
-- Not built-in, but can use sampling
|
||||
SELECT COUNT(*) * 10 as estimated_count
|
||||
FROM table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
AND rand() < 0.1 -- 10% sample
|
||||
```
|
||||
|
||||
### 7. Data Types and Casting
|
||||
|
||||
Hive type casting:
|
||||
```sql
|
||||
CAST(column_name AS BIGINT)
|
||||
CAST(column_name AS STRING)
|
||||
CAST(column_name AS DOUBLE)
|
||||
CAST(column_name AS DECIMAL(10,2))
|
||||
```
|
||||
|
||||
### 8. Window Functions
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
event_time,
|
||||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as event_seq,
|
||||
LAG(event_time, 1) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
|
||||
FROM events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
### 9. Array and Map Operations
|
||||
|
||||
**Array functions:**
|
||||
```sql
|
||||
SELECT
|
||||
array_contains(tags, 'premium') as is_premium,
|
||||
size(tags) as tag_count,
|
||||
tags[0] as first_tag
|
||||
FROM user_profiles
|
||||
```
|
||||
|
||||
**Map functions:**
|
||||
```sql
|
||||
SELECT
|
||||
map_keys(attributes) as attribute_names,
|
||||
map_values(attributes) as attribute_values,
|
||||
attributes['country'] as country
|
||||
FROM events
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Daily Event Aggregation
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
event_type,
|
||||
COUNT(*) as event_count,
|
||||
COUNT(DISTINCT user_id) as unique_users
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
GROUP BY
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST'),
|
||||
event_type
|
||||
ORDER BY date, event_type
|
||||
```
|
||||
|
||||
### User Segmentation
|
||||
```sql
|
||||
SELECT
|
||||
CASE
|
||||
WHEN purchase_count >= 10 THEN 'high_value'
|
||||
WHEN purchase_count >= 5 THEN 'medium_value'
|
||||
ELSE 'low_value'
|
||||
END as segment,
|
||||
COUNT(*) as user_count,
|
||||
AVG(total_spend) as avg_spend
|
||||
FROM (
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) as purchase_count,
|
||||
SUM(amount) as total_spend
|
||||
FROM database_name.purchases
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
GROUP BY user_id
|
||||
) user_stats
|
||||
GROUP BY
|
||||
CASE
|
||||
WHEN purchase_count >= 10 THEN 'high_value'
|
||||
WHEN purchase_count >= 5 THEN 'medium_value'
|
||||
ELSE 'low_value'
|
||||
END
|
||||
```
|
||||
|
||||
### Session Analysis
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
session_id,
|
||||
MIN(time) as session_start,
|
||||
MAX(time) as session_end,
|
||||
COUNT(*) as events_in_session
|
||||
FROM (
|
||||
SELECT
|
||||
user_id,
|
||||
time,
|
||||
SUM(is_new_session) OVER (
|
||||
PARTITION BY user_id
|
||||
ORDER BY time
|
||||
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|
||||
) as session_id
|
||||
FROM (
|
||||
SELECT
|
||||
user_id,
|
||||
time,
|
||||
CASE
|
||||
WHEN time - LAG(time) OVER (PARTITION BY user_id ORDER BY time) > 1800
|
||||
OR LAG(time) OVER (PARTITION BY user_id ORDER BY time) IS NULL
|
||||
THEN 1
|
||||
ELSE 0
|
||||
END as is_new_session
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02', 'JST')
|
||||
) with_session_flag
|
||||
) with_session_id
|
||||
GROUP BY user_id, session_id
|
||||
```
|
||||
|
||||
### Cohort Analysis
|
||||
```sql
|
||||
WITH first_purchase AS (
|
||||
SELECT
|
||||
user_id,
|
||||
TD_TIME_FORMAT(MIN(time), 'yyyy-MM', 'JST') as cohort_month
|
||||
FROM database_name.purchases
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST')
|
||||
GROUP BY user_id
|
||||
),
|
||||
monthly_purchases AS (
|
||||
SELECT
|
||||
user_id,
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM', 'JST') as purchase_month,
|
||||
SUM(amount) as monthly_spend
|
||||
FROM database_name.purchases
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST')
|
||||
GROUP BY user_id, TD_TIME_FORMAT(time, 'yyyy-MM', 'JST')
|
||||
)
|
||||
SELECT
|
||||
f.cohort_month,
|
||||
m.purchase_month,
|
||||
COUNT(DISTINCT m.user_id) as active_users,
|
||||
SUM(m.monthly_spend) as total_spend
|
||||
FROM first_purchase f
|
||||
JOIN monthly_purchases m ON f.user_id = m.user_id
|
||||
GROUP BY f.cohort_month, m.purchase_month
|
||||
ORDER BY f.cohort_month, m.purchase_month
|
||||
```
|
||||
|
||||
## Hive-Specific Features
|
||||
|
||||
### SerDe (Serializer/Deserializer)
|
||||
|
||||
When working with JSON data:
|
||||
```sql
|
||||
-- Usually handled automatically in TD, but awareness is important
|
||||
-- JSON SerDe allows querying nested JSON structures
|
||||
SELECT
|
||||
get_json_object(json_column, '$.user.id') as user_id,
|
||||
get_json_object(json_column, '$.event.type') as event_type
|
||||
FROM raw_events
|
||||
```
|
||||
|
||||
### LATERAL VIEW with EXPLODE
|
||||
|
||||
Flatten arrays:
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
tag
|
||||
FROM user_profiles
|
||||
LATERAL VIEW EXPLODE(tags) tags_table AS tag
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
Multiple LATERAL VIEWs:
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
tag,
|
||||
category
|
||||
FROM user_profiles
|
||||
LATERAL VIEW EXPLODE(tags) tags_table AS tag
|
||||
LATERAL VIEW EXPLODE(categories) cat_table AS category
|
||||
```
|
||||
|
||||
### Dynamic Partitioning
|
||||
|
||||
When creating tables (less common in TD):
|
||||
```sql
|
||||
SET hive.exec.dynamic.partition = true;
|
||||
SET hive.exec.dynamic.partition.mode = nonstrict;
|
||||
|
||||
INSERT OVERWRITE TABLE target_table PARTITION(dt)
|
||||
SELECT *, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as dt
|
||||
FROM source_table
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
**"FAILED: SemanticException Column time does not exist"**
|
||||
- Check table schema
|
||||
- Ensure table name is correct
|
||||
|
||||
**"OutOfMemoryError: Java heap space"**
|
||||
- Reduce time range in query
|
||||
- Use LIMIT for testing
|
||||
- Optimize JOINs (use MAPJOIN hint for small tables)
|
||||
|
||||
**"Too many dynamic partitions"**
|
||||
- Reduce partition count
|
||||
- Check dynamic partition settings
|
||||
|
||||
**"Expression not in GROUP BY key"**
|
||||
- All non-aggregated columns must be in GROUP BY
|
||||
- Or use aggregate functions (MAX, MIN, etc.)
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always use time filters** with TD_TIME_RANGE or direct time comparisons
|
||||
2. **Select only needed columns** to reduce I/O
|
||||
3. **Use MAPJOIN hint** for small table joins
|
||||
4. **Test on small time ranges** before full runs
|
||||
5. **Use appropriate timezone** (JST for Japan data)
|
||||
6. **Avoid SELECT *** in production queries
|
||||
7. **Use CTEs (WITH clauses)** for complex queries
|
||||
8. **Consider data volume** - Hive is batch-oriented
|
||||
9. **Monitor query progress** in TD console
|
||||
10. **Add comments** explaining business logic
|
||||
|
||||
## Migration Notes: Hive to Trino
|
||||
|
||||
When migrating from Hive to Trino:
|
||||
- Most syntax is compatible
|
||||
- Trino is generally faster for interactive queries
|
||||
- Some Hive UDFs may need replacement
|
||||
- Window functions syntax is similar
|
||||
- Approximate functions (APPROX_*) are more efficient in Trino
|
||||
|
||||
## Example Workflow
|
||||
|
||||
When helping users write Hive queries:
|
||||
|
||||
1. **Understand requirements** - What analysis is needed?
|
||||
2. **Identify tables** - Which TD tables to query?
|
||||
3. **Add time filters** - Always include TD_TIME_RANGE
|
||||
4. **Write base query** - Start simple
|
||||
5. **Add transformations** - Aggregations, JOINs, etc.
|
||||
6. **Optimize** - Use MAPJOIN hints, select only needed columns
|
||||
7. **Test** - Run on small dataset first
|
||||
8. **Scale** - Extend to full time range
|
||||
|
||||
## Resources
|
||||
|
||||
- Hive documentation: https://cwiki.apache.org/confluence/display/Hive
|
||||
- TD Hive functions: Check internal TD documentation
|
||||
- Consider migrating to Trino for better performance
|
||||
901
skills/td-mcp/SKILL.md
Normal file
901
skills/td-mcp/SKILL.md
Normal file
@@ -0,0 +1,901 @@
|
||||
---
|
||||
name: td-mcp
|
||||
description: Expert assistance for connecting Claude Code to Treasure Data via MCP (Model Context Protocol) server. Use this skill when users need help setting up TD MCP integration, using MCP tools to query TD, managing databases and tables through MCP, or troubleshooting MCP connections.
|
||||
---
|
||||
|
||||
# Treasure Data MCP Server
|
||||
|
||||
Expert assistance for integrating Treasure Data with Claude Code using the Model Context Protocol (MCP) server.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Setting up Claude Code to access Treasure Data via MCP
|
||||
- Configuring TD MCP server for the first time
|
||||
- Using MCP tools to query TD databases and tables
|
||||
- Exploring TD data through Claude Code's natural language interface
|
||||
- Managing TD workflows and CDP segments via MCP
|
||||
- Troubleshooting MCP connection or authentication issues
|
||||
- Understanding available MCP tools and their usage
|
||||
- Switching between TD regions (US, JP, EU, AP)
|
||||
|
||||
## What is TD MCP Server?
|
||||
|
||||
The Treasure Data MCP (Model Context Protocol) server enables Claude Code and other AI assistants to interact with Treasure Data through a secure, controlled interface. It provides:
|
||||
|
||||
- **Direct TD Access**: Query databases, tables, and execute SQL from Claude Code
|
||||
- **Read-Only by Default**: Secure access with optional write operations
|
||||
- **Multi-Region Support**: Works with all TD deployment regions
|
||||
- **Natural Language Queries**: Ask questions about your data in plain English
|
||||
- **Workflow Management**: Monitor and control TD workflows
|
||||
- **CDP Integration**: Manage customer segments and activations
|
||||
|
||||
**Status**: Public preview (free during preview, usage-based pricing planned)
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Installation
|
||||
|
||||
**Prerequisites:**
|
||||
- Node.js 18.0.0 or higher
|
||||
- TD API key with appropriate permissions
|
||||
- Claude Code installed
|
||||
|
||||
**Check Node.js Version:**
|
||||
```bash
|
||||
node --version # Should show v18.0.0 or higher
|
||||
```
|
||||
|
||||
**Install Node.js if needed:**
|
||||
```bash
|
||||
# macOS (Homebrew)
|
||||
brew install node
|
||||
|
||||
# Windows (winget)
|
||||
winget install OpenJS.NodeJS
|
||||
|
||||
# Linux (Ubuntu/Debian)
|
||||
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
|
||||
sudo apt-get install -y nodejs
|
||||
```
|
||||
|
||||
**No Installation Required:**
|
||||
The MCP server runs via npx, so no global installation is needed. However, you can optionally install globally:
|
||||
|
||||
```bash
|
||||
npm install -g @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
### 2. Setup with Claude Code
|
||||
|
||||
**Quick Setup:**
|
||||
```bash
|
||||
# Set your TD API key
|
||||
export TD_API_KEY="your_api_key_here"
|
||||
|
||||
# Add TD MCP server to Claude Code
|
||||
claude mcp add td -e TD_API_KEY=$TD_API_KEY -- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
**With Custom Configuration:**
|
||||
```bash
|
||||
# US region with default database
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_SITE=us01 \
|
||||
-e TD_DATABASE=sample_datasets \
|
||||
-- npx @treasuredata/mcp-server
|
||||
|
||||
# Tokyo region
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_SITE=jp01 \
|
||||
-- npx @treasuredata/mcp-server
|
||||
|
||||
# EU region
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_SITE=eu01 \
|
||||
-- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
**Enable Write Operations (Optional):**
|
||||
```bash
|
||||
# Enable INSERT, UPDATE, DELETE, CREATE, DROP operations
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_ENABLE_UPDATES=true \
|
||||
-- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
**Verify Installation:**
|
||||
```bash
|
||||
# List installed MCP servers
|
||||
claude mcp list
|
||||
|
||||
# Should show 'td' in the list
|
||||
```
|
||||
|
||||
### 3. Configuration Options
|
||||
|
||||
**Environment Variables:**
|
||||
|
||||
| Variable | Required | Default | Description |
|
||||
|----------|----------|---------|-------------|
|
||||
| `TD_API_KEY` | Yes | - | Your Treasure Data API key |
|
||||
| `TD_SITE` | No | `us01` | TD region: us01, jp01, eu01, ap02, ap03 |
|
||||
| `TD_DATABASE` | No | - | Default database for queries |
|
||||
| `TD_ENABLE_UPDATES` | No | `false` | Enable write operations |
|
||||
|
||||
**Regional Endpoints:**
|
||||
- `us01` - United States (default)
|
||||
- `jp01` - Tokyo, Japan
|
||||
- `eu01` - European Union
|
||||
- `ap02` - Asia Pacific (Seoul)
|
||||
- `ap03` - Asia Pacific (Tokyo alternate)
|
||||
|
||||
### 4. Using MCP Tools in Claude Code
|
||||
|
||||
Once configured, you can use natural language to interact with TD:
|
||||
|
||||
**Example Queries:**
|
||||
```
|
||||
"Show me all databases in my TD account"
|
||||
"List tables in the sample_datasets database"
|
||||
"What's the schema of the nasdaq table?"
|
||||
"Query the top 10 symbols by count from nasdaq table"
|
||||
"Show me yesterday's events from the user_events table"
|
||||
```
|
||||
|
||||
Claude Code will automatically use the appropriate MCP tools to fulfill your requests.
|
||||
|
||||
## Available MCP Tools
|
||||
|
||||
### Data Exploration Tools
|
||||
|
||||
#### 1. list_databases
|
||||
Lists all databases in your TD account.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"List all my TD databases"
|
||||
"Show me what databases I have"
|
||||
```
|
||||
|
||||
**Direct Tool Call:**
|
||||
```json
|
||||
{
|
||||
"tool": "list_databases"
|
||||
}
|
||||
```
|
||||
|
||||
#### 2. list_tables
|
||||
Lists tables in a database.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"List tables in sample_datasets"
|
||||
"What tables are in my analytics database?"
|
||||
```
|
||||
|
||||
**Direct Tool Call:**
|
||||
```json
|
||||
{
|
||||
"tool": "list_tables",
|
||||
"database": "sample_datasets"
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. describe_table
|
||||
Shows schema information for a specific table.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Describe the nasdaq table"
|
||||
"What columns does user_events have?"
|
||||
"Show me the schema of the transactions table"
|
||||
```
|
||||
|
||||
**Direct Tool Call:**
|
||||
```json
|
||||
{
|
||||
"tool": "describe_table",
|
||||
"database": "sample_datasets",
|
||||
"table": "nasdaq"
|
||||
}
|
||||
```
|
||||
|
||||
#### 4. current_database
|
||||
Shows the current database context.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"What database am I currently using?"
|
||||
"Show current database"
|
||||
```
|
||||
|
||||
#### 5. use_database
|
||||
Switches the database context for subsequent queries.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Switch to the analytics database"
|
||||
"Use the sample_datasets database"
|
||||
```
|
||||
|
||||
**Direct Tool Call:**
|
||||
```json
|
||||
{
|
||||
"tool": "use_database",
|
||||
"database": "analytics"
|
||||
}
|
||||
```
|
||||
|
||||
### Query Execution Tools
|
||||
|
||||
#### 6. query (Read-Only)
|
||||
Executes read-only SQL queries (SELECT, SHOW, DESCRIBE).
|
||||
|
||||
**Features:**
|
||||
- Default limit: 40 rows (optimized for LLM context)
|
||||
- Configurable up to 10,000 rows
|
||||
- Automatic timeout handling
|
||||
- Safe for production use
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Query the nasdaq table and show me the top 10 symbols"
|
||||
"SELECT COUNT(*) FROM user_events WHERE TD_INTERVAL(time, '-1d')"
|
||||
"Show me yesterday's revenue from transactions"
|
||||
```
|
||||
|
||||
**Direct Tool Call:**
|
||||
```json
|
||||
{
|
||||
"tool": "query",
|
||||
"sql": "SELECT symbol, COUNT(*) as cnt FROM nasdaq GROUP BY symbol ORDER BY cnt DESC LIMIT 10"
|
||||
}
|
||||
```
|
||||
|
||||
**With Custom Row Limit:**
|
||||
```json
|
||||
{
|
||||
"tool": "query",
|
||||
"sql": "SELECT * FROM large_table LIMIT 100",
|
||||
"max_rows": 100
|
||||
}
|
||||
```
|
||||
|
||||
#### 7. execute (Write Operations)
|
||||
Executes write operations (INSERT, UPDATE, DELETE, CREATE, DROP).
|
||||
|
||||
**Requirements:**
|
||||
- Must set `TD_ENABLE_UPDATES=true`
|
||||
- Use with caution in production
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Create a table called test_table with columns id and name"
|
||||
"Insert a test row into test_table"
|
||||
```
|
||||
|
||||
**Direct Tool Call:**
|
||||
```json
|
||||
{
|
||||
"tool": "execute",
|
||||
"sql": "CREATE TABLE test_table (id INT, name VARCHAR)"
|
||||
}
|
||||
```
|
||||
|
||||
### CDP Tools (Experimental)
|
||||
|
||||
#### 8. list_parent_segments
|
||||
Lists all parent segments in CDP.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"List all parent segments"
|
||||
"Show me my CDP parent segments"
|
||||
```
|
||||
|
||||
#### 9. get_parent_segment
|
||||
Gets details of a specific parent segment.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Get details for parent segment 12345"
|
||||
```
|
||||
|
||||
#### 10. list_segments
|
||||
Lists segments under a parent segment.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"List segments in parent segment 12345"
|
||||
```
|
||||
|
||||
#### 11. list_activations
|
||||
Lists syndications/activations for a segment.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Show activations for segment 67890"
|
||||
```
|
||||
|
||||
#### 12. get_segment
|
||||
Returns segment details including rules.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Get segment 67890 details"
|
||||
```
|
||||
|
||||
#### 13. parent_segment_sql
|
||||
Retrieves parent segment SQL statement.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Show SQL for parent segment 12345"
|
||||
```
|
||||
|
||||
#### 14. segment_sql
|
||||
Gets segment SQL with filtering conditions.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Show SQL for segment 67890"
|
||||
```
|
||||
|
||||
### Workflow Tools (Experimental)
|
||||
|
||||
#### 15. list_projects
|
||||
Lists workflow projects with pagination.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"List all workflow projects"
|
||||
"Show me my digdag projects"
|
||||
```
|
||||
|
||||
#### 16. list_workflows
|
||||
Lists workflows, optionally filtered by project.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"List workflows in project my_project"
|
||||
"Show all workflows"
|
||||
```
|
||||
|
||||
#### 17. list_sessions
|
||||
Lists execution sessions with status and time filtering.
|
||||
|
||||
**Usage in Claude Code:**
|
||||
```
|
||||
"Show recent workflow sessions"
|
||||
"List failed workflow executions"
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Initial Setup and Data Exploration
|
||||
|
||||
```bash
|
||||
# Step 1: Install and configure TD MCP
|
||||
export TD_API_KEY="your_api_key"
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_SITE=us01 \
|
||||
-- npx @treasuredata/mcp-server
|
||||
|
||||
# Step 2: Start Claude Code
|
||||
claude
|
||||
|
||||
# Step 3: Explore your data (in Claude Code conversation)
|
||||
```
|
||||
|
||||
In Claude Code:
|
||||
```
|
||||
> List all my TD databases
|
||||
|
||||
> Switch to sample_datasets database
|
||||
|
||||
> List tables in this database
|
||||
|
||||
> Describe the nasdaq table
|
||||
|
||||
> Query: SELECT symbol, COUNT(*) as cnt FROM nasdaq
|
||||
GROUP BY symbol ORDER BY cnt DESC LIMIT 10
|
||||
```
|
||||
|
||||
**Explanation:** This pattern establishes MCP connection and progressively explores TD data structure using natural language.
|
||||
|
||||
### Pattern 2: Time-Series Data Analysis
|
||||
|
||||
In Claude Code:
|
||||
```
|
||||
> Switch to the analytics database
|
||||
|
||||
> Query yesterday's events:
|
||||
SELECT
|
||||
event_name,
|
||||
COUNT(*) as event_count,
|
||||
COUNT(DISTINCT user_id) as unique_users
|
||||
FROM user_events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
GROUP BY event_name
|
||||
ORDER BY event_count DESC
|
||||
|
||||
> Now show me the last 7 days trend:
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
COUNT(*) as daily_events
|
||||
FROM user_events
|
||||
WHERE TD_INTERVAL(time, '-7d', 'JST')
|
||||
GROUP BY 1
|
||||
ORDER BY 1
|
||||
|
||||
> What's the hourly pattern for today?
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'HH', 'JST') as hour,
|
||||
COUNT(*) as event_count
|
||||
FROM user_events
|
||||
WHERE TD_INTERVAL(time, '0d', 'JST')
|
||||
GROUP BY 1
|
||||
ORDER BY 1
|
||||
```
|
||||
|
||||
**Explanation:** Uses TD_INTERVAL for efficient time-based queries. MCP automatically limits results for optimal LLM context.
|
||||
|
||||
### Pattern 3: Multi-Database Analysis
|
||||
|
||||
In Claude Code:
|
||||
```
|
||||
> Switch to sales database
|
||||
|
||||
> What's the total revenue from yesterday?
|
||||
SELECT SUM(amount) as total_revenue
|
||||
FROM transactions
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
> Now switch to marketing database
|
||||
|
||||
> How many new users signed up yesterday?
|
||||
SELECT COUNT(DISTINCT user_id) as new_users
|
||||
FROM user_signups
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
> Can you join data from both databases to calculate
|
||||
revenue per new user?
|
||||
```
|
||||
|
||||
**Explanation:** Demonstrates switching between databases and combining insights from multiple data sources.
|
||||
|
||||
### Pattern 4: Schema Discovery and Documentation
|
||||
|
||||
In Claude Code:
|
||||
```
|
||||
> List all tables in the production database
|
||||
|
||||
> For each table, show me:
|
||||
1. The table schema
|
||||
2. Row count
|
||||
3. Sample of first 5 rows
|
||||
|
||||
> Can you create a markdown document describing
|
||||
all tables and their relationships?
|
||||
|
||||
> Which tables have a 'user_id' column?
|
||||
```
|
||||
|
||||
**Explanation:** Uses MCP tools to automatically document database schema and relationships.
|
||||
|
||||
### Pattern 5: Workflow Monitoring
|
||||
|
||||
In Claude Code:
|
||||
```
|
||||
> List all workflow projects
|
||||
|
||||
> Show workflows in the etl_pipeline project
|
||||
|
||||
> List recent workflow sessions for the daily_aggregation workflow
|
||||
|
||||
> Are there any failed workflow executions in the last 24 hours?
|
||||
|
||||
> Show me the details of the most recent failed session
|
||||
```
|
||||
|
||||
**Explanation:** Monitors TD workflows through MCP, useful for debugging and operational awareness.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use Read-Only Mode by Default**
|
||||
- Keep `TD_ENABLE_UPDATES=false` for safety
|
||||
- Only enable write operations when necessary
|
||||
- Create separate MCP connections for read and write if needed
|
||||
|
||||
2. **Leverage TD Time Functions**
|
||||
```sql
|
||||
-- Good: Uses partition pruning
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Avoid: Scans entire table
|
||||
WHERE date = '2024-01-01'
|
||||
```
|
||||
|
||||
3. **Be Mindful of Result Sizes**
|
||||
- Default 40 rows is optimized for LLM context
|
||||
- Use aggregations instead of raw data when possible
|
||||
- Add explicit LIMIT clauses for large tables
|
||||
|
||||
4. **Set Default Database**
|
||||
```bash
|
||||
# Reduces need to specify database repeatedly
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_DATABASE=your_main_database \
|
||||
-- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
5. **Use Natural Language**
|
||||
- MCP works best with conversational queries
|
||||
- You can refine queries through dialogue
|
||||
- Claude Code understands context from previous messages
|
||||
|
||||
6. **Regional Configuration**
|
||||
- Always set `TD_SITE` to match your data location
|
||||
- Reduces latency and ensures data residency compliance
|
||||
|
||||
7. **Secure API Key Management**
|
||||
```bash
|
||||
# Store in environment variable
|
||||
export TD_API_KEY="your_api_key"
|
||||
|
||||
# Or use shell config
|
||||
echo 'export TD_API_KEY="your_api_key"' >> ~/.bashrc
|
||||
```
|
||||
|
||||
8. **Test Queries Interactively**
|
||||
- Use Claude Code to test and refine queries
|
||||
- Once optimized, save to scripts or workflows
|
||||
|
||||
9. **Use Descriptive Names**
|
||||
- When adding MCP server, use descriptive names
|
||||
- Example: `claude mcp add td-production` vs `td-staging`
|
||||
|
||||
10. **Monitor Usage**
|
||||
- MCP is free during preview
|
||||
- Track your usage patterns for future planning
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Issue: Node.js Version Too Old
|
||||
|
||||
**Symptoms:**
|
||||
- Error: "Node.js version 18.0.0 or higher required"
|
||||
- MCP server fails to start
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check current version
|
||||
node --version
|
||||
|
||||
# Update Node.js (macOS)
|
||||
brew upgrade node
|
||||
|
||||
# Update Node.js (Ubuntu/Debian)
|
||||
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
|
||||
sudo apt-get install -y nodejs
|
||||
|
||||
# Verify
|
||||
node --version # Should show v18.0.0+
|
||||
```
|
||||
|
||||
### Issue: Authentication Failed
|
||||
|
||||
**Symptoms:**
|
||||
- "Authentication error"
|
||||
- "Invalid API key"
|
||||
- "403 Forbidden"
|
||||
|
||||
**Solutions:**
|
||||
1. **Verify API Key Format**
|
||||
```bash
|
||||
echo $TD_API_KEY
|
||||
# Should display your API key
|
||||
```
|
||||
|
||||
2. **Check API Key Permissions**
|
||||
- Log in to TD console
|
||||
- Verify key has appropriate permissions
|
||||
- Regenerate if necessary
|
||||
|
||||
3. **Verify Regional Endpoint**
|
||||
```bash
|
||||
# Ensure TD_SITE matches your account region
|
||||
claude mcp remove td
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_SITE=jp01 \
|
||||
-- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
4. **Test API Key with curl**
|
||||
```bash
|
||||
curl -H "Authorization: TD1 $TD_API_KEY" \
|
||||
https://api.treasuredata.com/v3/database/list
|
||||
```
|
||||
|
||||
### Issue: MCP Server Not Found
|
||||
|
||||
**Symptoms:**
|
||||
- "td: command not found"
|
||||
- Claude Code doesn't see TD tools
|
||||
|
||||
**Solutions:**
|
||||
1. **Verify MCP Installation**
|
||||
```bash
|
||||
claude mcp list
|
||||
# Should show 'td' in output
|
||||
```
|
||||
|
||||
2. **Re-add MCP Server**
|
||||
```bash
|
||||
claude mcp remove td
|
||||
claude mcp add td -e TD_API_KEY=$TD_API_KEY -- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
3. **Check npx is Available**
|
||||
```bash
|
||||
which npx
|
||||
# Should show path to npx
|
||||
```
|
||||
|
||||
4. **Restart Claude Code**
|
||||
```bash
|
||||
# Exit and restart Claude Code
|
||||
```
|
||||
|
||||
### Issue: Query Timeout
|
||||
|
||||
**Symptoms:**
|
||||
- Query runs but never completes
|
||||
- Timeout error after several minutes
|
||||
|
||||
**Solutions:**
|
||||
1. **Add Time Filter**
|
||||
```sql
|
||||
-- Add TD_INTERVAL for partition pruning
|
||||
SELECT * FROM large_table
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
2. **Use Aggregations**
|
||||
```sql
|
||||
-- Instead of raw data
|
||||
SELECT * FROM huge_table
|
||||
|
||||
-- Use aggregations
|
||||
SELECT date, COUNT(*) as cnt
|
||||
FROM huge_table
|
||||
GROUP BY date
|
||||
```
|
||||
|
||||
3. **Reduce Result Size**
|
||||
```sql
|
||||
SELECT * FROM table LIMIT 40
|
||||
```
|
||||
|
||||
4. **Check Query Complexity**
|
||||
- Avoid complex joins on large tables
|
||||
- Use subqueries to filter data first
|
||||
|
||||
### Issue: Write Operations Blocked
|
||||
|
||||
**Symptoms:**
|
||||
- "Write operations not enabled"
|
||||
- Cannot execute CREATE, INSERT, UPDATE, DELETE
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Remove existing MCP server
|
||||
claude mcp remove td
|
||||
|
||||
# Re-add with write operations enabled
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_ENABLE_UPDATES=true \
|
||||
-- npx @treasuredata/mcp-server
|
||||
|
||||
# Restart Claude Code
|
||||
```
|
||||
|
||||
**Warning:** Only enable write operations if absolutely necessary and you understand the risks.
|
||||
|
||||
### Issue: Wrong Region/Site
|
||||
|
||||
**Symptoms:**
|
||||
- Cannot see expected databases
|
||||
- Data appears missing
|
||||
- Slow query performance
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check your account region in TD console
|
||||
# Then update MCP configuration
|
||||
|
||||
claude mcp remove td
|
||||
claude mcp add td \
|
||||
-e TD_API_KEY=$TD_API_KEY \
|
||||
-e TD_SITE=jp01 \
|
||||
-- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
### Issue: MCP Tools Not Appearing in Claude Code
|
||||
|
||||
**Symptoms:**
|
||||
- Claude Code doesn't suggest TD tools
|
||||
- Natural language queries don't trigger MCP
|
||||
|
||||
**Solutions:**
|
||||
1. **Restart Claude Code**
|
||||
```bash
|
||||
# Exit and restart
|
||||
```
|
||||
|
||||
2. **Verify MCP Configuration**
|
||||
```bash
|
||||
claude mcp list
|
||||
claude mcp status
|
||||
```
|
||||
|
||||
3. **Check MCP Server Logs**
|
||||
```bash
|
||||
# Claude Code logs location (varies by OS)
|
||||
# macOS: ~/Library/Logs/Claude/
|
||||
# Linux: ~/.config/Claude/logs/
|
||||
# Windows: %APPDATA%\Claude\logs\
|
||||
```
|
||||
|
||||
4. **Be Explicit in Queries**
|
||||
```
|
||||
Instead of: "Show data"
|
||||
Try: "List tables in my TD database"
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Multiple MCP Connections
|
||||
|
||||
You can set up multiple MCP connections for different environments:
|
||||
|
||||
```bash
|
||||
# Production (read-only)
|
||||
claude mcp add td-prod \
|
||||
-e TD_API_KEY=$TD_PROD_API_KEY \
|
||||
-e TD_SITE=us01 \
|
||||
-e TD_DATABASE=production \
|
||||
-- npx @treasuredata/mcp-server
|
||||
|
||||
# Staging (with write access)
|
||||
claude mcp add td-staging \
|
||||
-e TD_API_KEY=$TD_STAGING_API_KEY \
|
||||
-e TD_SITE=us01 \
|
||||
-e TD_DATABASE=staging \
|
||||
-e TD_ENABLE_UPDATES=true \
|
||||
-- npx @treasuredata/mcp-server
|
||||
|
||||
# Development (full access)
|
||||
claude mcp add td-dev \
|
||||
-e TD_API_KEY=$TD_DEV_API_KEY \
|
||||
-e TD_SITE=us01 \
|
||||
-e TD_DATABASE=development \
|
||||
-e TD_ENABLE_UPDATES=true \
|
||||
-- npx @treasuredata/mcp-server
|
||||
```
|
||||
|
||||
In Claude Code, specify which connection:
|
||||
```
|
||||
"Using td-prod, list all databases"
|
||||
"Using td-staging, create test table"
|
||||
```
|
||||
|
||||
### Custom Configuration Files
|
||||
|
||||
Instead of command-line setup, you can manually edit Claude Code's MCP configuration:
|
||||
|
||||
**Location:**
|
||||
- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
||||
- Linux: `~/.config/Claude/claude_desktop_config.json`
|
||||
- Windows: `%APPDATA%\Claude\claude_desktop_config.json`
|
||||
|
||||
**Example Configuration:**
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"treasuredata": {
|
||||
"command": "npx",
|
||||
"args": ["@treasuredata/mcp-server"],
|
||||
"env": {
|
||||
"TD_API_KEY": "your_api_key",
|
||||
"TD_SITE": "us01",
|
||||
"TD_DATABASE": "sample_datasets",
|
||||
"TD_ENABLE_UPDATES": "false"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Using with Other MCP Servers
|
||||
|
||||
TD MCP can work alongside other MCP servers:
|
||||
|
||||
```bash
|
||||
# Add TD MCP
|
||||
claude mcp add td -e TD_API_KEY=$TD_API_KEY -- npx @treasuredata/mcp-server
|
||||
|
||||
# Add other MCP servers
|
||||
claude mcp add github -- npx @modelcontextprotocol/server-github
|
||||
claude mcp add postgres -- npx @modelcontextprotocol/server-postgres
|
||||
|
||||
# Use in Claude Code
|
||||
"Query my TD data and compare with GitHub metrics"
|
||||
```
|
||||
|
||||
### Programmatic Usage
|
||||
|
||||
While primarily used through Claude Code, you can also interact with TD MCP programmatically using MCP SDKs.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **API Key Storage**
|
||||
- Never commit API keys to version control
|
||||
- Use environment variables or secure vaults
|
||||
- Rotate keys regularly
|
||||
|
||||
2. **Read-Only by Default**
|
||||
- Keep `TD_ENABLE_UPDATES=false` unless necessary
|
||||
- Create separate read-only API keys for MCP use
|
||||
|
||||
3. **Network Security**
|
||||
- MCP communicates over HTTPS
|
||||
- API keys are never logged by default
|
||||
- All operations are audited in TD
|
||||
|
||||
4. **Least Privilege**
|
||||
- Use API keys with minimal required permissions
|
||||
- Create database-specific API keys if possible
|
||||
|
||||
5. **Multi-User Environments**
|
||||
- Each user should use their own API key
|
||||
- Avoid sharing MCP configurations
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub Repository**: https://github.com/treasure-data/td-mcp-server
|
||||
- **npm Package**: https://www.npmjs.com/package/@treasuredata/mcp-server
|
||||
- **MCP Protocol**: https://modelcontextprotocol.io/
|
||||
- **TD Documentation**: https://docs.treasuredata.com/
|
||||
- **Claude Code Documentation**: https://docs.claude.com/claude-code
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **trino**: Understanding SQL syntax for MCP queries
|
||||
- **hive**: Hive-specific functions available through MCP
|
||||
- **pytd**: Python alternative to MCP for programmatic access
|
||||
- **trino-cli**: Command-line alternative to MCP
|
||||
|
||||
## Comparison with Other Tools
|
||||
|
||||
| Tool | Interface | Best For |
|
||||
|------|-----------|----------|
|
||||
| **TD MCP** | Natural language in Claude Code | Conversational data exploration, quick insights |
|
||||
| **Trino CLI** | Command-line SQL | Scripting, automation, terminal workflows |
|
||||
| **pytd** | Python SDK | ETL pipelines, complex transformations, notebooks |
|
||||
| **TD Console** | Web UI | Visualization, sharing, collaboration |
|
||||
|
||||
**Recommendation:** Use TD MCP for interactive data exploration and ad-hoc analysis within Claude Code. Use other tools for production pipelines and automation.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-01 | TD MCP Server: Public Preview | License: Apache-2.0*
|
||||
854
skills/trino-cli/SKILL.md
Normal file
854
skills/trino-cli/SKILL.md
Normal file
@@ -0,0 +1,854 @@
|
||||
---
|
||||
name: trino-cli
|
||||
description: Expert assistance for using the Trino CLI to query Treasure Data interactively from the command line. Use this skill when users need help with trino command-line tool, interactive query execution, connecting to TD via CLI, or terminal-based data exploration.
|
||||
---
|
||||
|
||||
# Trino CLI for Treasure Data
|
||||
|
||||
Expert assistance for using the Trino CLI to query and explore Treasure Data interactively from the command line.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Running interactive queries against TD from the terminal
|
||||
- Exploring TD databases, tables, and schemas via command line
|
||||
- Quick ad-hoc data analysis without opening a web console
|
||||
- Writing shell scripts that execute TD queries
|
||||
- Debugging queries with immediate feedback
|
||||
- Working in terminal-based workflows (SSH, tmux, screen)
|
||||
- Executing batch queries from the command line
|
||||
- Testing queries before integrating into applications
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Installation
|
||||
|
||||
**Download Trino CLI:**
|
||||
```bash
|
||||
# Download the latest version
|
||||
curl -o trino https://repo1.maven.org/maven2/io/trino/trino-cli/477/trino-cli-477-executable.jar
|
||||
|
||||
# Make it executable
|
||||
chmod +x trino
|
||||
|
||||
# Move to PATH (optional)
|
||||
sudo mv trino /usr/local/bin/
|
||||
|
||||
# Verify installation
|
||||
trino --version
|
||||
```
|
||||
|
||||
**Requirements:**
|
||||
- Java 11 or later (Java 22+ recommended)
|
||||
- Network access to TD API endpoint
|
||||
- TD API key
|
||||
|
||||
**Alternative for Windows:**
|
||||
```powershell
|
||||
# Run with Java directly
|
||||
java -jar trino-cli-477-executable.jar --version
|
||||
```
|
||||
|
||||
### 2. Connecting to Treasure Data
|
||||
|
||||
**Basic Connection:**
|
||||
```bash
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user YOUR_TD_API_KEY \
|
||||
--schema your_database
|
||||
```
|
||||
|
||||
**Using Environment Variable:**
|
||||
```bash
|
||||
# Set TD API key as environment variable (recommended)
|
||||
export TD_API_KEY="your_api_key_here"
|
||||
|
||||
# Connect using environment variable
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema sample_datasets
|
||||
```
|
||||
|
||||
**Regional Endpoints:**
|
||||
- **US**: `https://api-presto.treasuredata.com`
|
||||
- **Tokyo**: `https://api-presto.treasuredata.co.jp`
|
||||
- **EU**: `https://api-presto.eu01.treasuredata.com`
|
||||
|
||||
### 3. Interactive Mode
|
||||
|
||||
Once connected, you enter an interactive SQL prompt:
|
||||
|
||||
```sql
|
||||
trino:sample_datasets> SELECT COUNT(*) FROM nasdaq;
|
||||
_col0
|
||||
---------
|
||||
8807790
|
||||
(1 row)
|
||||
|
||||
Query 20250123_123456_00001_abcde, FINISHED, 1 node
|
||||
Splits: 17 total, 17 done (100.00%)
|
||||
0.45 [8.81M rows, 0B] [19.6M rows/s, 0B/s]
|
||||
|
||||
trino:sample_datasets> SHOW TABLES;
|
||||
Table
|
||||
-----------
|
||||
nasdaq
|
||||
www_access
|
||||
(2 rows)
|
||||
```
|
||||
|
||||
**Interactive Commands:**
|
||||
- `QUIT` or `EXIT` - Exit the CLI
|
||||
- `CLEAR` - Clear the screen
|
||||
- `HELP` - Show help information
|
||||
- `HISTORY` - Show command history
|
||||
- `USE schema_name` - Switch to different database
|
||||
- `SHOW CATALOGS` - List available catalogs
|
||||
- `SHOW SCHEMAS` - List databases
|
||||
- `SHOW TABLES` - List tables in current schema
|
||||
- `DESCRIBE table_name` - Show table structure
|
||||
- `EXPLAIN query` - Show query execution plan
|
||||
|
||||
### 4. Batch Mode (Non-Interactive)
|
||||
|
||||
Execute queries from command line without entering interactive mode:
|
||||
|
||||
**Single Query:**
|
||||
```bash
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema sample_datasets \
|
||||
--execute "SELECT COUNT(*) FROM nasdaq"
|
||||
```
|
||||
|
||||
**From File:**
|
||||
```bash
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema sample_datasets \
|
||||
--file queries.sql
|
||||
```
|
||||
|
||||
**From stdin (pipe):**
|
||||
```bash
|
||||
echo "SELECT symbol, COUNT(*) as cnt FROM nasdaq GROUP BY symbol LIMIT 10" | \
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema sample_datasets
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Interactive Data Exploration
|
||||
|
||||
```bash
|
||||
# Connect to TD
|
||||
export TD_API_KEY="your_api_key"
|
||||
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema sample_datasets
|
||||
|
||||
# Then in the interactive prompt:
|
||||
```
|
||||
|
||||
```sql
|
||||
-- List all databases
|
||||
trino:sample_datasets> SHOW SCHEMAS;
|
||||
|
||||
-- Switch to a different database
|
||||
trino:sample_datasets> USE analytics;
|
||||
|
||||
-- List tables
|
||||
trino:analytics> SHOW TABLES;
|
||||
|
||||
-- Describe table structure
|
||||
trino:analytics> DESCRIBE user_events;
|
||||
|
||||
-- Preview data
|
||||
trino:analytics> SELECT * FROM user_events LIMIT 10;
|
||||
|
||||
-- Quick aggregation
|
||||
trino:analytics> SELECT
|
||||
event_name,
|
||||
COUNT(*) as cnt
|
||||
FROM user_events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
GROUP BY event_name
|
||||
ORDER BY cnt DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- Exit
|
||||
trino:analytics> EXIT;
|
||||
```
|
||||
|
||||
**Explanation:** Interactive mode is perfect for exploring data, testing queries, and understanding table structures with immediate feedback.
|
||||
|
||||
### Pattern 2: Scripted Query Execution
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# daily_report.sh - Generate daily report from TD
|
||||
|
||||
export TD_API_KEY="your_api_key"
|
||||
TD_SERVER="https://api-presto.treasuredata.com"
|
||||
DATABASE="analytics"
|
||||
|
||||
# Create SQL file
|
||||
cat > /tmp/daily_report.sql <<'EOF'
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
COUNT(*) as total_events,
|
||||
COUNT(DISTINCT user_id) as unique_users,
|
||||
APPROX_PERCENTILE(session_duration, 0.5) as median_duration
|
||||
FROM user_events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
GROUP BY 1;
|
||||
EOF
|
||||
|
||||
# Execute query and save results
|
||||
trino \
|
||||
--server $TD_SERVER \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema $DATABASE \
|
||||
--file /tmp/daily_report.sql \
|
||||
--output-format CSV_HEADER > daily_report_$(date +%Y%m%d).csv
|
||||
|
||||
echo "Report saved to daily_report_$(date +%Y%m%d).csv"
|
||||
|
||||
# Clean up
|
||||
rm /tmp/daily_report.sql
|
||||
```
|
||||
|
||||
**Explanation:** Batch mode is ideal for automation, scheduled reports, and integrating TD queries into shell scripts.
|
||||
|
||||
### Pattern 3: Multiple Queries with Error Handling
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# etl_pipeline.sh - Run multiple queries in sequence
|
||||
|
||||
export TD_API_KEY="your_api_key"
|
||||
TD_SERVER="https://api-presto.treasuredata.com"
|
||||
|
||||
run_query() {
|
||||
local query="$1"
|
||||
local description="$2"
|
||||
|
||||
echo "Running: $description"
|
||||
|
||||
if trino \
|
||||
--server $TD_SERVER \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema analytics \
|
||||
--execute "$query"; then
|
||||
echo "✓ Success: $description"
|
||||
return 0
|
||||
else
|
||||
echo "✗ Failed: $description"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Step 1: Create aggregated table
|
||||
run_query "
|
||||
CREATE TABLE IF NOT EXISTS daily_summary AS
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
user_id,
|
||||
COUNT(*) as event_count
|
||||
FROM raw_events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
GROUP BY 1, 2
|
||||
" "Create daily summary table" || exit 1
|
||||
|
||||
# Step 2: Validate row count
|
||||
COUNT=$(trino \
|
||||
--server $TD_SERVER \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema analytics \
|
||||
--execute "SELECT COUNT(*) FROM daily_summary" \
|
||||
--output-format CSV_UNQUOTED)
|
||||
|
||||
echo "Processed $COUNT rows"
|
||||
|
||||
if [ "$COUNT" -gt 0 ]; then
|
||||
echo "✓ Pipeline completed successfully"
|
||||
else
|
||||
echo "✗ Warning: No data processed"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
**Explanation:** Demonstrates error handling, sequential query execution, and validation in shell scripts using Trino CLI.
|
||||
|
||||
### Pattern 4: Configuration File for Easy Access
|
||||
|
||||
```bash
|
||||
# Create ~/.trino_config
|
||||
cat > ~/.trino_config <<EOF
|
||||
server=https://api-presto.treasuredata.com
|
||||
catalog=td
|
||||
user=$TD_API_KEY
|
||||
schema=sample_datasets
|
||||
output-format-interactive=ALIGNED
|
||||
EOF
|
||||
|
||||
# Now you can simply run:
|
||||
trino
|
||||
|
||||
# No need to specify server, user, etc. every time
|
||||
```
|
||||
|
||||
**Alternative - Create a wrapper script:**
|
||||
```bash
|
||||
# Create ~/bin/td-trino
|
||||
cat > ~/bin/td-trino <<'EOF'
|
||||
#!/bin/bash
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user ${TD_API_KEY} \
|
||||
--schema ${1:-sample_datasets}
|
||||
EOF
|
||||
|
||||
chmod +x ~/bin/td-trino
|
||||
|
||||
# Usage:
|
||||
td-trino # connects to sample_datasets
|
||||
td-trino analytics # connects to analytics database
|
||||
```
|
||||
|
||||
**Explanation:** Configuration files and wrapper scripts simplify repeated connections and reduce typing.
|
||||
|
||||
### Pattern 5: Formatted Output for Different Use Cases
|
||||
|
||||
```bash
|
||||
export TD_API_KEY="your_api_key"
|
||||
TD_SERVER="https://api-presto.treasuredata.com"
|
||||
DATABASE="sample_datasets"
|
||||
QUERY="SELECT symbol, COUNT(*) as cnt FROM nasdaq GROUP BY symbol ORDER BY cnt DESC LIMIT 10"
|
||||
|
||||
# CSV for spreadsheets
|
||||
trino \
|
||||
--server $TD_SERVER \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema $DATABASE \
|
||||
--execute "$QUERY" \
|
||||
--output-format CSV_HEADER > results.csv
|
||||
|
||||
# JSON for APIs/applications
|
||||
trino \
|
||||
--server $TD_SERVER \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema $DATABASE \
|
||||
--execute "$QUERY" \
|
||||
--output-format JSON > results.json
|
||||
|
||||
# TSV for data processing
|
||||
trino \
|
||||
--server $TD_SERVER \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema $DATABASE \
|
||||
--execute "$QUERY" \
|
||||
--output-format TSV_HEADER > results.tsv
|
||||
|
||||
# Markdown for documentation
|
||||
trino \
|
||||
--server $TD_SERVER \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema $DATABASE \
|
||||
--execute "$QUERY" \
|
||||
--output-format MARKDOWN > results.md
|
||||
```
|
||||
|
||||
**Explanation:** Different output formats enable integration with various downstream tools and workflows.
|
||||
|
||||
## Command-Line Options Reference
|
||||
|
||||
### Connection Options
|
||||
|
||||
| Option | Description | Example |
|
||||
|--------|-------------|---------|
|
||||
| `--server` | TD Presto endpoint | `https://api-presto.treasuredata.com` |
|
||||
| `--catalog` | Catalog name | `td` |
|
||||
| `--user` | TD API key | `$TD_API_KEY` |
|
||||
| `--schema` | Default database | `sample_datasets` |
|
||||
| `--password` | Enable password prompt | Not used for TD |
|
||||
|
||||
### Execution Options
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `--execute "SQL"` | Execute single query and exit |
|
||||
| `--file queries.sql` | Execute queries from file |
|
||||
| `--ignore-errors` | Continue on error (batch mode) |
|
||||
| `--client-request-timeout` | Query timeout (default: 2m) |
|
||||
|
||||
### Output Options
|
||||
|
||||
| Option | Description | Values |
|
||||
|--------|-------------|--------|
|
||||
| `--output-format` | Batch mode output format | CSV, JSON, TSV, MARKDOWN, etc. |
|
||||
| `--output-format-interactive` | Interactive mode format | ALIGNED, VERTICAL, AUTO |
|
||||
| `--no-progress` | Disable progress indicator | |
|
||||
| `--pager` | Custom pager program | `less`, `more`, etc. |
|
||||
|
||||
### Display Options
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `--debug` | Enable debug output |
|
||||
| `--log-levels-file` | Custom logging configuration |
|
||||
| `--disable-auto-suggestion` | Turn off autocomplete |
|
||||
|
||||
### Configuration
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `--config` | Configuration file path | Alternative to `~/.trino_config` |
|
||||
| `--session property=value` | Set session property |
|
||||
| `--timezone` | Session timezone |
|
||||
| `--client-tags` | Add metadata tags |
|
||||
|
||||
## Output Formats
|
||||
|
||||
Available output formats:
|
||||
|
||||
### Batch Mode Formats
|
||||
|
||||
- **CSV** - Comma-separated, quoted strings (default for batch)
|
||||
- **CSV_HEADER** - CSV with header row
|
||||
- **CSV_UNQUOTED** - CSV without quotes
|
||||
- **CSV_HEADER_UNQUOTED** - CSV with header, no quotes
|
||||
- **TSV** - Tab-separated values
|
||||
- **TSV_HEADER** - TSV with header row
|
||||
- **JSON** - JSON array of objects
|
||||
- **MARKDOWN** - Markdown table format
|
||||
- **NULL** - Execute but discard output
|
||||
|
||||
### Interactive Mode Formats
|
||||
|
||||
- **ALIGNED** - Pretty-printed table (default)
|
||||
- **VERTICAL** - One column per line
|
||||
- **AUTO** - Automatic format selection
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
# CSV with header for Excel
|
||||
trino --execute "SELECT * FROM table" --output-format CSV_HEADER
|
||||
|
||||
# JSON for jq processing
|
||||
trino --execute "SELECT * FROM table" --output-format JSON | jq '.[] | .user_id'
|
||||
|
||||
# Aligned for terminal viewing
|
||||
trino --output-format-interactive ALIGNED
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always Use Environment Variables for API Keys**
|
||||
```bash
|
||||
# In ~/.bashrc or ~/.zshrc
|
||||
export TD_API_KEY="your_api_key"
|
||||
```
|
||||
Never hardcode API keys in scripts or commands
|
||||
|
||||
2. **Create Configuration File for Frequent Use**
|
||||
```bash
|
||||
# ~/.trino_config
|
||||
server=https://api-presto.treasuredata.com
|
||||
catalog=td
|
||||
user=$TD_API_KEY
|
||||
```
|
||||
|
||||
3. **Use TD Time Functions for Partition Pruning**
|
||||
```sql
|
||||
-- Good: Uses partition pruning
|
||||
SELECT * FROM events WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Bad: Scans entire table
|
||||
SELECT * FROM events WHERE date = '2024-01-01'
|
||||
```
|
||||
|
||||
4. **Add LIMIT for Exploratory Queries**
|
||||
```sql
|
||||
-- Safe exploratory query
|
||||
SELECT * FROM large_table LIMIT 100;
|
||||
```
|
||||
|
||||
5. **Use Batch Mode for Automation**
|
||||
```bash
|
||||
# Don't use interactive mode in cron jobs
|
||||
trino --execute "SELECT ..." --output-format CSV > output.csv
|
||||
```
|
||||
|
||||
6. **Enable Debug Mode for Troubleshooting**
|
||||
```bash
|
||||
trino --debug --execute "SELECT ..."
|
||||
```
|
||||
|
||||
7. **Set Reasonable Timeouts**
|
||||
```bash
|
||||
# For long-running queries
|
||||
trino --client-request-timeout 30m --execute "SELECT ..."
|
||||
```
|
||||
|
||||
8. **Use Appropriate Output Format**
|
||||
- CSV/TSV for data processing
|
||||
- JSON for programmatic parsing
|
||||
- ALIGNED for human viewing
|
||||
- MARKDOWN for documentation
|
||||
|
||||
9. **Leverage History in Interactive Mode**
|
||||
- Use ↑/↓ arrow keys to navigate history
|
||||
- Use Ctrl+R for reverse search
|
||||
- History saved in `~/.trino_history`
|
||||
|
||||
10. **Test Queries Interactively First**
|
||||
Test complex queries in interactive mode before adding to scripts
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Issue: Connection Refused or Timeout
|
||||
|
||||
**Symptoms:**
|
||||
- `Connection refused`
|
||||
- `Read timed out`
|
||||
- Cannot connect to server
|
||||
|
||||
**Solutions:**
|
||||
1. **Verify Endpoint URL**
|
||||
```bash
|
||||
# Check you're using the correct regional endpoint
|
||||
# US: https://api-presto.treasuredata.com
|
||||
# Tokyo: https://api-presto.treasuredata.co.jp
|
||||
# EU: https://api-presto.eu01.treasuredata.com
|
||||
```
|
||||
|
||||
2. **Check Network Connectivity**
|
||||
```bash
|
||||
curl -I https://api-presto.treasuredata.com
|
||||
```
|
||||
|
||||
3. **Verify Firewall/Proxy Settings**
|
||||
```bash
|
||||
# If behind proxy
|
||||
trino --http-proxy proxy.example.com:8080 --server ...
|
||||
```
|
||||
|
||||
4. **Increase Timeout**
|
||||
```bash
|
||||
trino --client-request-timeout 10m --server ...
|
||||
```
|
||||
|
||||
### Issue: Authentication Errors
|
||||
|
||||
**Symptoms:**
|
||||
- `Authentication failed`
|
||||
- `Unauthorized`
|
||||
- `403 Forbidden`
|
||||
|
||||
**Solutions:**
|
||||
1. **Check API Key Format**
|
||||
```bash
|
||||
# Verify API key is set
|
||||
echo $TD_API_KEY # Should display your API key
|
||||
```
|
||||
|
||||
2. **Verify API Key is Set**
|
||||
```bash
|
||||
if [ -z "$TD_API_KEY" ]; then
|
||||
echo "TD_API_KEY is not set"
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Test API Key with curl**
|
||||
```bash
|
||||
curl -H "Authorization: TD1 $TD_API_KEY" \
|
||||
https://api.treasuredata.com/v3/database/list
|
||||
```
|
||||
|
||||
4. **Regenerate API Key**
|
||||
- Log in to TD console
|
||||
- Generate new API key
|
||||
- Update environment variable
|
||||
|
||||
### Issue: Query Timeout
|
||||
|
||||
**Symptoms:**
|
||||
- Query runs but never completes
|
||||
- `Query exceeded maximum time limit`
|
||||
|
||||
**Solutions:**
|
||||
1. **Add Time Filter**
|
||||
```sql
|
||||
-- Add partition pruning
|
||||
SELECT * FROM table
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
2. **Increase Timeout**
|
||||
```bash
|
||||
trino --client-request-timeout 30m --execute "..."
|
||||
```
|
||||
|
||||
3. **Use Aggregations Instead**
|
||||
```sql
|
||||
-- Instead of fetching all rows
|
||||
SELECT * FROM huge_table
|
||||
|
||||
-- Aggregate first
|
||||
SELECT date, COUNT(*) FROM huge_table GROUP BY date
|
||||
```
|
||||
|
||||
4. **Add LIMIT Clause**
|
||||
```sql
|
||||
SELECT * FROM large_table LIMIT 10000
|
||||
```
|
||||
|
||||
### Issue: Java Not Found
|
||||
|
||||
**Symptoms:**
|
||||
- `java: command not found`
|
||||
- `JAVA_HOME not set`
|
||||
|
||||
**Solutions:**
|
||||
1. **Install Java**
|
||||
```bash
|
||||
# macOS
|
||||
brew install openjdk@17
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install openjdk-17-jdk
|
||||
|
||||
# RHEL/CentOS
|
||||
sudo yum install java-17-openjdk
|
||||
```
|
||||
|
||||
2. **Set JAVA_HOME**
|
||||
```bash
|
||||
# Add to ~/.bashrc or ~/.zshrc
|
||||
export JAVA_HOME=$(/usr/libexec/java_home -v 17) # macOS
|
||||
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk # Linux
|
||||
```
|
||||
|
||||
3. **Verify Java Version**
|
||||
```bash
|
||||
java -version # Should show 11 or higher
|
||||
```
|
||||
|
||||
### Issue: Output Not Formatted Correctly
|
||||
|
||||
**Symptoms:**
|
||||
- Broken table alignment
|
||||
- Missing columns
|
||||
- Garbled characters
|
||||
|
||||
**Solutions:**
|
||||
1. **Specify Output Format Explicitly**
|
||||
```bash
|
||||
# For batch mode
|
||||
trino --execute "..." --output-format CSV_HEADER
|
||||
|
||||
# For interactive mode
|
||||
trino --output-format-interactive ALIGNED
|
||||
```
|
||||
|
||||
2. **Check Terminal Width**
|
||||
```bash
|
||||
# Wider terminal for better formatting
|
||||
stty size # Check current size
|
||||
```
|
||||
|
||||
3. **Use VERTICAL Format for Wide Tables**
|
||||
```sql
|
||||
trino> SELECT * FROM wide_table\G
|
||||
-- Or set format
|
||||
trino> --output-format-interactive VERTICAL
|
||||
```
|
||||
|
||||
4. **Disable Pager if Issues**
|
||||
```bash
|
||||
trino --pager='' # Disable pager
|
||||
```
|
||||
|
||||
### Issue: History Not Working
|
||||
|
||||
**Symptoms:**
|
||||
- Arrow keys don't show previous commands
|
||||
- History not saved between sessions
|
||||
|
||||
**Solutions:**
|
||||
1. **Check History File Permissions**
|
||||
```bash
|
||||
ls -la ~/.trino_history
|
||||
chmod 600 ~/.trino_history
|
||||
```
|
||||
|
||||
2. **Specify Custom History File**
|
||||
```bash
|
||||
trino --history-file ~/my_trino_history
|
||||
```
|
||||
|
||||
3. **Check Disk Space**
|
||||
```bash
|
||||
df -h ~ # Ensure home directory has space
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Session Properties
|
||||
|
||||
Set query-specific properties:
|
||||
|
||||
```bash
|
||||
# Set query priority
|
||||
trino \
|
||||
--session query_priority=1 \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--execute "SELECT * FROM large_table"
|
||||
|
||||
# Set multiple properties
|
||||
trino \
|
||||
--session query_max_run_time=1h \
|
||||
--session query_priority=2 \
|
||||
--execute "SELECT ..."
|
||||
```
|
||||
|
||||
### Using with jq for JSON Processing
|
||||
|
||||
```bash
|
||||
# Query and process with jq
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema sample_datasets \
|
||||
--execute "SELECT symbol, COUNT(*) as cnt FROM nasdaq GROUP BY symbol LIMIT 10" \
|
||||
--output-format JSON | \
|
||||
jq '.[] | select(.cnt > 1000) | .symbol'
|
||||
```
|
||||
|
||||
### Parallel Query Execution
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Run multiple queries in parallel
|
||||
|
||||
export TD_API_KEY="your_api_key"
|
||||
|
||||
run_query() {
|
||||
local database=$1
|
||||
local output=$2
|
||||
trino \
|
||||
--server https://api-presto.treasuredata.com \
|
||||
--catalog td \
|
||||
--user $TD_API_KEY \
|
||||
--schema $database \
|
||||
--execute "SELECT COUNT(*) FROM events WHERE TD_INTERVAL(time, '-1d', 'JST')" \
|
||||
--output-format CSV > $output
|
||||
}
|
||||
|
||||
# Run in parallel using background jobs
|
||||
run_query "database1" "count1.csv" &
|
||||
run_query "database2" "count2.csv" &
|
||||
run_query "database3" "count3.csv" &
|
||||
|
||||
# Wait for all to complete
|
||||
wait
|
||||
|
||||
echo "All queries completed"
|
||||
```
|
||||
|
||||
### Integration with Other Tools
|
||||
|
||||
**With csvkit:**
|
||||
```bash
|
||||
trino --execute "SELECT * FROM table" --output-format CSV | \
|
||||
csvstat
|
||||
```
|
||||
|
||||
**With awk:**
|
||||
```bash
|
||||
trino --execute "SELECT symbol, cnt FROM nasdaq_summary" --output-format TSV | \
|
||||
awk '$2 > 1000 { print $1 }'
|
||||
```
|
||||
|
||||
**With Python:**
|
||||
```bash
|
||||
trino --execute "SELECT * FROM table" --output-format JSON | \
|
||||
python -c "import sys, json; data = json.load(sys.stdin); print(len(data))"
|
||||
```
|
||||
|
||||
## Interactive Commands Reference
|
||||
|
||||
Commands available in interactive mode:
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `QUIT` or `EXIT` | Exit the CLI |
|
||||
| `CLEAR` | Clear the screen |
|
||||
| `HELP` | Show help information |
|
||||
| `HISTORY` | Display command history |
|
||||
| `USE schema` | Switch to different database |
|
||||
| `SHOW CATALOGS` | List available catalogs |
|
||||
| `SHOW SCHEMAS` | List all databases |
|
||||
| `SHOW TABLES` | List tables in current schema |
|
||||
| `SHOW COLUMNS FROM table` | Show table structure |
|
||||
| `DESCRIBE table` | Show detailed table info |
|
||||
| `EXPLAIN query` | Show query execution plan |
|
||||
| `SHOW FUNCTIONS` | List available functions |
|
||||
|
||||
## Resources
|
||||
|
||||
- **Trino CLI Documentation**: https://trino.io/docs/current/client/cli.html
|
||||
- **TD Presto Endpoints**:
|
||||
- US: https://api-presto.treasuredata.com
|
||||
- Tokyo: https://api-presto.treasuredata.co.jp
|
||||
- EU: https://api-presto.eu01.treasuredata.com
|
||||
- **TD Documentation**: https://docs.treasuredata.com/
|
||||
- **Trino SQL Reference**: https://trino.io/docs/current/sql.html
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **trino**: SQL query syntax and optimization for Trino
|
||||
- **hive**: Understanding Hive SQL differences
|
||||
- **pytd**: Python-based querying (alternative to CLI)
|
||||
- **td-javascript-sdk**: Browser-based data collection
|
||||
|
||||
## Comparison with Other Tools
|
||||
|
||||
| Tool | Purpose | When to Use |
|
||||
|------|---------|-------------|
|
||||
| **Trino CLI** | Interactive command-line queries | Ad-hoc queries, exploration, shell scripts |
|
||||
| **TD Console** | Web-based query interface | GUI preference, visualization, sharing |
|
||||
| **pytd** | Python SDK | Complex ETL, pandas integration, notebooks |
|
||||
| **TD Toolbelt** | TD-specific CLI | Bulk import, job management, administration |
|
||||
|
||||
**Recommendation:** Use Trino CLI for quick interactive queries and terminal-based workflows. Use TD Console for visualization and sharing. Use pytd for complex data pipelines.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-01 | Trino CLI version: 477+*
|
||||
567
skills/trino-optimizer/SKILL.md
Normal file
567
skills/trino-optimizer/SKILL.md
Normal file
@@ -0,0 +1,567 @@
|
||||
---
|
||||
name: trino-optimizer
|
||||
description: Expert assistance for optimizing Trino query performance in Treasure Data. Use this skill when users need help with slow queries, memory issues, timeouts, or performance tuning. Focuses on partition pruning, column selection, output optimization, and common performance bottlenecks.
|
||||
---
|
||||
|
||||
# Trino Query Optimizer
|
||||
|
||||
Expert assistance for optimizing Trino query performance and troubleshooting performance issues in Treasure Data environments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Queries are running too slowly or timing out
|
||||
- Encountering memory limit errors
|
||||
- Optimizing existing queries for better performance
|
||||
- Debugging PAGE_TRANSPORT_TIMEOUT errors
|
||||
- Reducing query costs
|
||||
- Analyzing query execution plans
|
||||
- Joins between large and small tables need optimization (use Magic Comments)
|
||||
- Frequent ID lookups on large tables (>100M rows) need optimization (use UDP)
|
||||
|
||||
## Core Optimization Principles
|
||||
|
||||
### 1. Time-Based Partition Pruning (Most Important)
|
||||
|
||||
TD tables are partitioned by 1-hour buckets. **Always** filter on time for optimal performance.
|
||||
|
||||
**Use TD_INTERVAL or TD_TIME_RANGE:**
|
||||
```sql
|
||||
-- Good: Uses partition pruning
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Good: Explicit time range
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
|
||||
-- Bad: No time filter - scans entire table!
|
||||
WHERE user_id = 123 -- Missing time filter
|
||||
```
|
||||
|
||||
**Impact:** Without time filters, queries scan the entire table instead of just relevant partitions, dramatically increasing execution time and cost.
|
||||
|
||||
### 2. Column Selection
|
||||
|
||||
TD uses columnar storage format. **Select only needed columns.**
|
||||
|
||||
```sql
|
||||
-- Good: Select specific columns
|
||||
SELECT user_id, event_type, time
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Bad: SELECT * reads all columns
|
||||
SELECT * -- Slower and more expensive
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
**Impact:** Each additional column increases I/O. Reading 10 columns vs 50 columns can make a significant performance difference.
|
||||
|
||||
### 3. Output Optimization
|
||||
|
||||
Use appropriate output methods for your use case.
|
||||
|
||||
**CREATE TABLE AS (CTAS) - 5x faster than SELECT:**
|
||||
```sql
|
||||
-- Best for large result sets
|
||||
CREATE TABLE analysis_results AS
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
COUNT(*) as events
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY 1
|
||||
```
|
||||
|
||||
**INSERT INTO - For appending to existing tables:**
|
||||
```sql
|
||||
INSERT INTO daily_summary
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
COUNT(*) as events
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
GROUP BY 1
|
||||
```
|
||||
|
||||
**SELECT with LIMIT - For exploratory queries:**
|
||||
```sql
|
||||
-- Good for testing/exploration
|
||||
SELECT *
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
LIMIT 1000
|
||||
```
|
||||
|
||||
**Impact:** CTAS skips JSON serialization, directly writing to partitioned tables, providing 5x better performance.
|
||||
|
||||
### 4. REGEXP_LIKE vs Multiple LIKE Clauses
|
||||
|
||||
Replace chained LIKE clauses with REGEXP_LIKE for better performance.
|
||||
|
||||
```sql
|
||||
-- Bad: Multiple LIKE clauses (slow)
|
||||
WHERE (
|
||||
column LIKE '%android%'
|
||||
OR column LIKE '%ios%'
|
||||
OR column LIKE '%mobile%'
|
||||
)
|
||||
|
||||
-- Good: Single REGEXP_LIKE (fast)
|
||||
WHERE REGEXP_LIKE(column, 'android|ios|mobile')
|
||||
```
|
||||
|
||||
**Impact:** Trino's optimizer cannot efficiently handle multiple LIKE clauses chained with OR, but can optimize REGEXP_LIKE.
|
||||
|
||||
### 5. Approximate Functions
|
||||
|
||||
Use APPROX_* functions for large-scale aggregations.
|
||||
|
||||
```sql
|
||||
-- Fast: Approximate distinct count
|
||||
SELECT APPROX_DISTINCT(user_id) as unique_users
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
|
||||
-- Slow: Exact distinct count (memory intensive)
|
||||
SELECT COUNT(DISTINCT user_id) as unique_users
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
```
|
||||
|
||||
**Available approximate functions:**
|
||||
- `APPROX_DISTINCT(column)` - Approximate unique count
|
||||
- `APPROX_PERCENTILE(column, percentile)` - Approximate percentile (e.g., 0.95 for p95)
|
||||
- `APPROX_SET(column)` - Returns HyperLogLog sketch for set operations
|
||||
|
||||
**Impact:** Approximate functions use HyperLogLog algorithm, dramatically reducing memory usage with ~2% error rate.
|
||||
|
||||
### 6. JOIN Optimization
|
||||
|
||||
Order joins with smaller tables first when possible.
|
||||
|
||||
```sql
|
||||
-- Good: Small table joined to large table
|
||||
SELECT l.*, s.attribute
|
||||
FROM large_table l
|
||||
JOIN small_lookup s ON l.id = s.id
|
||||
WHERE TD_INTERVAL(l.time, '-1d', 'JST')
|
||||
|
||||
-- Consider: If one table is very small, use a subquery to reduce it first
|
||||
SELECT e.*
|
||||
FROM events e
|
||||
JOIN (
|
||||
SELECT user_id
|
||||
FROM premium_users
|
||||
WHERE subscription_status = 'active'
|
||||
) p ON e.user_id = p.user_id
|
||||
WHERE TD_INTERVAL(e.time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
**Magic Comments for Join Distribution:**
|
||||
|
||||
When joins fail with memory errors or run slowly, use magic comments to control join algorithm.
|
||||
|
||||
```sql
|
||||
-- BROADCAST: Small right table fits in memory
|
||||
-- set session join_distribution_type = 'BROADCAST'
|
||||
SELECT *
|
||||
FROM large_table, small_lookup
|
||||
WHERE large_table.id = small_lookup.id
|
||||
```
|
||||
|
||||
**Use BROADCAST when:** Right table is very small and fits in memory. Avoids partitioning overhead but uses more memory per node.
|
||||
|
||||
```sql
|
||||
-- PARTITIONED: Both tables large or memory issues
|
||||
-- set session join_distribution_type = 'PARTITIONED'
|
||||
SELECT *
|
||||
FROM large_table_a, large_table_b
|
||||
WHERE large_table_a.id = large_table_b.id
|
||||
```
|
||||
|
||||
**Use PARTITIONED when:** Both tables are large or right table doesn't fit in memory. Default algorithm that reduces memory per node.
|
||||
|
||||
**Tips:**
|
||||
- Always include time filters on all tables in JOIN
|
||||
- Consider using IN or EXISTS for single-column lookups
|
||||
- Avoid FULL OUTER JOIN when possible (expensive)
|
||||
|
||||
### 7. GROUP BY Optimization
|
||||
|
||||
Use column positions in GROUP BY for complex expressions.
|
||||
|
||||
```sql
|
||||
-- Good: Use column positions
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
event_type,
|
||||
COUNT(*) as cnt
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY 1, 2 -- Cleaner and avoids re-evaluation
|
||||
|
||||
-- Works but verbose:
|
||||
GROUP BY TD_TIME_STRING(time, 'd!', 'JST'), event_type
|
||||
```
|
||||
|
||||
### 8. Window Functions Optimization
|
||||
|
||||
Partition window functions appropriately to reduce memory usage.
|
||||
|
||||
```sql
|
||||
-- Good: Partition by high-cardinality column
|
||||
SELECT
|
||||
user_id,
|
||||
event_time,
|
||||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as seq
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Be careful: Window over entire dataset (memory intensive)
|
||||
SELECT
|
||||
event_time,
|
||||
ROW_NUMBER() OVER (ORDER BY event_time) as global_seq
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
### 9. User-Defined Partitioning (UDP)
|
||||
|
||||
Hash partition tables on frequently queried columns for fast ID lookups and large joins.
|
||||
|
||||
```sql
|
||||
-- Create UDP table with bucketing on customer_id
|
||||
CREATE TABLE customer_events WITH (
|
||||
bucketed_on = array['customer_id'],
|
||||
bucket_count = 512
|
||||
) AS
|
||||
SELECT * FROM raw_events
|
||||
WHERE TD_INTERVAL(time, '-30d', 'JST')
|
||||
```
|
||||
|
||||
**When to use UDP:**
|
||||
- Fast lookups by specific IDs (needle-in-a-haystack queries)
|
||||
- Aggregations on specific columns
|
||||
- Very large joins on same keys
|
||||
- Tables with >100M rows
|
||||
|
||||
**Choosing bucketing columns:**
|
||||
- High cardinality: `customer_id`, `user_id`, `email`, `account_number`
|
||||
- Frequently used with equality predicates (`WHERE customer_id = 12345`)
|
||||
- Supported types: int, long, string
|
||||
- Maximum 3 columns
|
||||
|
||||
```sql
|
||||
-- Accelerated: Equality on all bucketing columns
|
||||
SELECT * FROM customer_events
|
||||
WHERE customer_id = 12345
|
||||
AND TD_INTERVAL(time, '-7d', 'JST')
|
||||
|
||||
-- NOT accelerated: Missing bucketing column
|
||||
SELECT * FROM customer_events
|
||||
WHERE TD_INTERVAL(time, '-7d', 'JST')
|
||||
```
|
||||
|
||||
**UDP for large joins:**
|
||||
```sql
|
||||
-- Both tables bucketed on same key enables colocated join
|
||||
-- set session join_distribution_type = 'PARTITIONED'
|
||||
-- set session colocated_join = 'true'
|
||||
SELECT a.*, b.*
|
||||
FROM customer_events_a a
|
||||
JOIN customer_events_b b ON a.customer_id = b.customer_id
|
||||
WHERE TD_INTERVAL(a.time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
**Impact:** UDP scans only relevant buckets, dramatically improving performance for ID lookups and reducing memory for large joins.
|
||||
|
||||
## Common Performance Issues
|
||||
|
||||
### Issue: Query Timeout
|
||||
|
||||
**Symptoms:**
|
||||
- Query exceeds maximum execution time
|
||||
- "Query exceeded maximum time" error
|
||||
|
||||
**Solutions:**
|
||||
1. Add or narrow time filters with TD_INTERVAL/TD_TIME_RANGE
|
||||
2. Reduce selected columns
|
||||
3. Use LIMIT for testing before full run
|
||||
4. Break into smaller queries with intermediate tables
|
||||
5. Use approximate functions instead of exact aggregations
|
||||
|
||||
**Example fix:**
|
||||
```sql
|
||||
-- Before: Times out
|
||||
SELECT COUNT(DISTINCT user_id)
|
||||
FROM events
|
||||
WHERE event_type = 'click'
|
||||
|
||||
-- After: Much faster
|
||||
SELECT APPROX_DISTINCT(user_id)
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
AND event_type = 'click'
|
||||
```
|
||||
|
||||
### Issue: Memory Limit Exceeded
|
||||
|
||||
**Symptoms:**
|
||||
- "Query exceeded per-node memory limit"
|
||||
- "Query exceeded distributed memory limit"
|
||||
|
||||
**Solutions:**
|
||||
1. Reduce data volume with time filters
|
||||
2. Use APPROX_DISTINCT instead of COUNT(DISTINCT)
|
||||
3. Reduce number of columns in SELECT
|
||||
4. Limit GROUP BY cardinality
|
||||
5. Optimize JOIN operations
|
||||
6. Process data in smaller time chunks
|
||||
|
||||
**Example fix:**
|
||||
```sql
|
||||
-- Before: Memory exceeded
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(DISTINCT session_id),
|
||||
COUNT(DISTINCT page_url),
|
||||
COUNT(DISTINCT referrer)
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY user_id
|
||||
|
||||
-- After: Uses approximate functions
|
||||
SELECT
|
||||
user_id,
|
||||
APPROX_DISTINCT(session_id) as sessions,
|
||||
APPROX_DISTINCT(page_url) as pages,
|
||||
APPROX_DISTINCT(referrer) as referrers
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY user_id
|
||||
```
|
||||
|
||||
### Issue: PAGE_TRANSPORT_TIMEOUT
|
||||
|
||||
**Symptoms:**
|
||||
- Frequent PAGE_TRANSPORT_TIMEOUT errors
|
||||
- Network-related query failures
|
||||
|
||||
**Solutions:**
|
||||
1. Narrow TD_TIME_RANGE to reduce data volume
|
||||
2. Reduce number of columns in SELECT
|
||||
3. Break large queries into smaller time ranges
|
||||
4. Use CTAS instead of SELECT for large results
|
||||
|
||||
**Example fix:**
|
||||
```sql
|
||||
-- Before: Transporting too much data
|
||||
SELECT *
|
||||
FROM events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31')
|
||||
|
||||
-- After: Process in chunks
|
||||
SELECT user_id, event_type, time
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
### Issue: Slow Aggregations
|
||||
|
||||
**Symptoms:**
|
||||
- COUNT(DISTINCT) taking very long
|
||||
- Large GROUP BY queries slow
|
||||
|
||||
**Solutions:**
|
||||
1. Use APPROX_DISTINCT instead of COUNT(DISTINCT)
|
||||
2. Filter data with time range first
|
||||
3. Consider pre-aggregating with intermediate tables
|
||||
4. Reduce GROUP BY dimensions
|
||||
|
||||
**Example fix:**
|
||||
```sql
|
||||
-- Before: Slow exact count
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
COUNT(DISTINCT user_id) as dau
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY 1
|
||||
|
||||
-- After: Fast approximate count
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
APPROX_DISTINCT(user_id) as dau
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY 1
|
||||
```
|
||||
|
||||
## Query Analysis Workflow
|
||||
|
||||
When optimizing a slow query, follow this workflow:
|
||||
|
||||
### Step 1: Add EXPLAIN
|
||||
|
||||
```sql
|
||||
EXPLAIN
|
||||
SELECT ...
|
||||
FROM ...
|
||||
WHERE ...
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Full table scans (missing time filters)
|
||||
- High cardinality GROUP BY
|
||||
- Expensive JOINs
|
||||
- Missing filters
|
||||
|
||||
### Step 2: Check Time Filters
|
||||
|
||||
```sql
|
||||
-- Ensure time filter exists and is specific
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
OR TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
```
|
||||
|
||||
### Step 3: Reduce Column Count
|
||||
|
||||
```sql
|
||||
-- Select only needed columns
|
||||
SELECT user_id, event_type, time
|
||||
-- Not SELECT *
|
||||
```
|
||||
|
||||
### Step 4: Use Approximate Functions
|
||||
|
||||
```sql
|
||||
-- Replace COUNT(DISTINCT) with APPROX_DISTINCT
|
||||
-- Replace PERCENTILE with APPROX_PERCENTILE
|
||||
```
|
||||
|
||||
### Step 5: Test with LIMIT
|
||||
|
||||
```sql
|
||||
-- Test logic on small subset first
|
||||
SELECT ...
|
||||
FROM ...
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
LIMIT 1000
|
||||
```
|
||||
|
||||
### Step 6: Use CTAS for Large Results
|
||||
|
||||
```sql
|
||||
-- For queries returning many rows
|
||||
CREATE TABLE results AS
|
||||
SELECT ...
|
||||
```
|
||||
|
||||
## Advanced Optimization Techniques
|
||||
|
||||
### Pre-aggregation Strategy
|
||||
|
||||
For frequently-run queries, create pre-aggregated tables.
|
||||
|
||||
```sql
|
||||
-- Daily aggregation job
|
||||
CREATE TABLE daily_user_events AS
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
user_id,
|
||||
COUNT(*) as event_count,
|
||||
APPROX_DISTINCT(session_id) as sessions
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
GROUP BY 1, 2
|
||||
|
||||
-- Fast queries on pre-aggregated data
|
||||
SELECT
|
||||
date,
|
||||
SUM(event_count) as total_events,
|
||||
SUM(sessions) as total_sessions
|
||||
FROM daily_user_events
|
||||
WHERE date >= '2024-01-01'
|
||||
GROUP BY 1
|
||||
```
|
||||
|
||||
### Incremental Processing
|
||||
|
||||
Process large time ranges incrementally.
|
||||
|
||||
```sql
|
||||
-- Instead of processing 1 year at once
|
||||
-- Process day by day and INSERT INTO result table
|
||||
|
||||
-- Day 1
|
||||
INSERT INTO monthly_summary
|
||||
SELECT ...
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Day 2
|
||||
INSERT INTO monthly_summary
|
||||
SELECT ...
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-2d/-1d', 'JST')
|
||||
|
||||
-- etc.
|
||||
```
|
||||
|
||||
### Materialized Views Pattern
|
||||
|
||||
Create and maintain summary tables for common queries.
|
||||
|
||||
```sql
|
||||
-- Create summary table once
|
||||
CREATE TABLE user_daily_summary AS
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
user_id,
|
||||
COUNT(*) as events,
|
||||
APPROX_DISTINCT(page_url) as pages_visited
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-30d', 'JST')
|
||||
GROUP BY 1, 2
|
||||
|
||||
-- Query the summary (much faster)
|
||||
SELECT
|
||||
date,
|
||||
AVG(events) as avg_events_per_user,
|
||||
AVG(pages_visited) as avg_pages_per_user
|
||||
FROM user_daily_summary
|
||||
GROUP BY 1
|
||||
ORDER BY 1
|
||||
```
|
||||
|
||||
## Best Practices Checklist
|
||||
|
||||
Before running a query, verify:
|
||||
|
||||
- [ ] Time filter added (TD_INTERVAL or TD_TIME_RANGE)
|
||||
- [ ] Selecting specific columns (not SELECT *)
|
||||
- [ ] Using APPROX_DISTINCT for unique counts
|
||||
- [ ] Using REGEXP_LIKE instead of multiple LIKEs
|
||||
- [ ] Tested with LIMIT on small dataset first
|
||||
- [ ] Using CTAS for large result sets
|
||||
- [ ] All joined tables have time filters
|
||||
- [ ] GROUP BY uses reasonable cardinality
|
||||
- [ ] Window functions partition appropriately
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
Reducing query execution time also reduces cost:
|
||||
|
||||
1. **Time filters** - Scan less data
|
||||
2. **Column selection** - Read less data
|
||||
3. **Approximate functions** - Use less memory
|
||||
4. **CTAS** - Avoid expensive JSON serialization
|
||||
5. **Pre-aggregation** - Query summary tables instead of raw data
|
||||
|
||||
## Resources
|
||||
|
||||
- Use EXPLAIN to analyze query plans
|
||||
- Monitor query execution in TD Console
|
||||
- Check query statistics for memory and time usage
|
||||
- Consider Trino 423+ features for better performance
|
||||
502
skills/trino-to-hive-migration/SKILL.md
Normal file
502
skills/trino-to-hive-migration/SKILL.md
Normal file
@@ -0,0 +1,502 @@
|
||||
---
|
||||
name: trino-to-hive-migration
|
||||
description: Expert guidance for migrating queries from Trino to Hive when encountering memory errors, timeouts, or performance issues. Use this skill when Trino queries fail with memory limits or when batch processing requirements make Hive a better choice.
|
||||
---
|
||||
|
||||
# Trino to Hive Migration Guide
|
||||
|
||||
Expert assistance for converting Trino queries to Hive to resolve memory errors, timeouts, and resource constraints.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Trino queries fail with memory limit exceeded errors
|
||||
- Queries time out in Trino despite optimization
|
||||
- Processing very large datasets (months/years of data)
|
||||
- Running batch ETL jobs that don't require interactive speed
|
||||
- Encountering "Query exceeded per-node memory limit" errors
|
||||
- Need more stable execution for complex aggregations
|
||||
- Cost optimization for long-running queries
|
||||
|
||||
## Why Hive Often Solves Memory Problems
|
||||
|
||||
### Key Differences
|
||||
|
||||
**Trino (Fast but Memory-Intensive):**
|
||||
- In-memory processing for speed
|
||||
- Limited by node memory (per-node limits)
|
||||
- Best for interactive queries on moderate data volumes
|
||||
- Fast failures when memory exceeded
|
||||
|
||||
**Hive with Tez (Slower but More Scalable):**
|
||||
- TD's Hive uses Apache Tez execution engine (not traditional MapReduce)
|
||||
- Tez provides faster performance than classic MapReduce
|
||||
- Disk-based processing with memory spilling
|
||||
- Can handle much larger datasets
|
||||
- Spills to disk when memory insufficient
|
||||
- More fault-tolerant for large jobs
|
||||
- Better for batch processing
|
||||
|
||||
### When Hive is Better
|
||||
|
||||
✅ Use Hive when:
|
||||
- Processing > 1 month of data at once
|
||||
- Memory errors in Trino
|
||||
- Complex multi-way JOINs
|
||||
- High cardinality GROUP BY operations
|
||||
- Batch ETL (scheduled daily/hourly jobs)
|
||||
- Query doesn't need to be interactive
|
||||
|
||||
❌ Use Trino when:
|
||||
- Interactive/ad-hoc queries
|
||||
- Need results quickly (< 5 minutes)
|
||||
- Small to moderate data volumes
|
||||
- Real-time dashboards
|
||||
|
||||
## Migration Workflow
|
||||
|
||||
### Step 1: Identify the Problem
|
||||
|
||||
**Trino memory error example:**
|
||||
```
|
||||
Query exceeded per-node memory limit of 10GB
|
||||
Query exceeded distributed memory limit of 100GB
|
||||
```
|
||||
|
||||
**When you see this:**
|
||||
1. First try Trino optimization (see trino-optimizer skill)
|
||||
2. If optimization doesn't help, migrate to Hive
|
||||
|
||||
### Step 2: Convert Query Syntax
|
||||
|
||||
Most Trino queries work in Hive with minor changes.
|
||||
|
||||
## Syntax Conversion Guide
|
||||
|
||||
### Basic SELECT (Usually Compatible)
|
||||
|
||||
```sql
|
||||
-- Trino
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) as event_count
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY user_id
|
||||
|
||||
-- Hive (same syntax)
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) as event_count
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
GROUP BY user_id
|
||||
```
|
||||
|
||||
### Time Functions
|
||||
|
||||
**TD_TIME_STRING (Trino only):**
|
||||
```sql
|
||||
-- Trino
|
||||
SELECT TD_TIME_STRING(time, 'd!', 'JST') as date
|
||||
|
||||
-- Hive: Use TD_TIME_FORMAT instead
|
||||
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date
|
||||
```
|
||||
|
||||
**TD_TIME_RANGE (Compatible):**
|
||||
```sql
|
||||
-- Both Trino and Hive
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
```
|
||||
|
||||
**TD_INTERVAL (Compatible):**
|
||||
```sql
|
||||
-- Both Trino and Hive
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
```
|
||||
|
||||
### Approximate Functions
|
||||
|
||||
**APPROX_DISTINCT (Compatible via Hivemall!):**
|
||||
```sql
|
||||
-- Trino
|
||||
SELECT APPROX_DISTINCT(user_id) as unique_users
|
||||
|
||||
-- Hive: SAME SYNTAX! approx_distinct is available via Hivemall
|
||||
SELECT approx_distinct(user_id) as unique_users
|
||||
-- Also available as: approx_count_distinct(user_id)
|
||||
-- Uses HyperLogLog algorithm, same as Trino
|
||||
|
||||
-- Hive Option 2: Exact count (slower but accurate)
|
||||
SELECT COUNT(DISTINCT user_id) as unique_users
|
||||
|
||||
-- Hive Option 3: Sample for estimation
|
||||
SELECT COUNT(DISTINCT user_id) * 10 as estimated_users
|
||||
FROM table_name
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
AND rand() < 0.1 -- 10% sample
|
||||
```
|
||||
|
||||
**Good news:** `approx_distinct()` works in both Trino and Hive! Hivemall provides this function (along with `approx_count_distinct()` as an alias) using the same HyperLogLog algorithm, so you often don't need to change this function when migrating.
|
||||
|
||||
**APPROX_PERCENTILE (Trino) → PERCENTILE (Hive):**
|
||||
```sql
|
||||
-- Trino
|
||||
SELECT APPROX_PERCENTILE(response_time, 0.95) as p95
|
||||
|
||||
-- Hive
|
||||
SELECT PERCENTILE(response_time, 0.95) as p95
|
||||
```
|
||||
|
||||
### Array and String Functions
|
||||
|
||||
**ARRAY_AGG (Trino) → COLLECT_LIST (Hive):**
|
||||
```sql
|
||||
-- Trino
|
||||
SELECT ARRAY_AGG(product_id) as products
|
||||
|
||||
-- Hive
|
||||
SELECT COLLECT_LIST(product_id) as products
|
||||
```
|
||||
|
||||
**STRING_AGG (Trino) → CONCAT_WS + COLLECT_LIST (Hive):**
|
||||
```sql
|
||||
-- Trino
|
||||
SELECT STRING_AGG(name, ', ') as names
|
||||
|
||||
-- Hive
|
||||
SELECT CONCAT_WS(', ', COLLECT_LIST(name)) as names
|
||||
```
|
||||
|
||||
**SPLIT (Compatible but different behavior):**
|
||||
```sql
|
||||
-- Trino: Returns array
|
||||
SELECT SPLIT(text, ',')[1] -- 0-indexed
|
||||
|
||||
-- Hive: Returns array
|
||||
SELECT SPLIT(text, ',')[0] -- Also 0-indexed (compatible)
|
||||
```
|
||||
|
||||
### REGEXP Functions
|
||||
|
||||
**REGEXP_EXTRACT (Compatible):**
|
||||
```sql
|
||||
-- Both Trino and Hive
|
||||
SELECT REGEXP_EXTRACT(url, 'product_id=([0-9]+)', 1)
|
||||
```
|
||||
|
||||
**REGEXP_LIKE (Trino) → RLIKE (Hive):**
|
||||
```sql
|
||||
-- Trino
|
||||
WHERE REGEXP_LIKE(column, 'pattern')
|
||||
|
||||
-- Hive
|
||||
WHERE column RLIKE 'pattern'
|
||||
```
|
||||
|
||||
### Window Functions (Mostly Compatible)
|
||||
|
||||
```sql
|
||||
-- Both support similar syntax
|
||||
SELECT
|
||||
user_id,
|
||||
event_time,
|
||||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as seq,
|
||||
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
```
|
||||
|
||||
### JOINs (Compatible)
|
||||
|
||||
**MAPJOIN hint in Hive for small tables:**
|
||||
```sql
|
||||
-- Trino (automatic small table optimization)
|
||||
SELECT *
|
||||
FROM large_table l
|
||||
JOIN small_table s ON l.id = s.id
|
||||
|
||||
-- Hive (explicit hint for better performance)
|
||||
SELECT /*+ MAPJOIN(small_table) */ *
|
||||
FROM large_table l
|
||||
JOIN small_table s ON l.id = s.id
|
||||
```
|
||||
|
||||
### CAST Functions (Mostly Compatible)
|
||||
|
||||
```sql
|
||||
-- Both Trino and Hive
|
||||
CAST(column AS BIGINT)
|
||||
CAST(column AS VARCHAR)
|
||||
CAST(column AS DOUBLE)
|
||||
|
||||
-- Trino: TRY_CAST (returns NULL on failure)
|
||||
TRY_CAST(column AS BIGINT)
|
||||
|
||||
-- Hive: No TRY_CAST, use CAST with NULL handling
|
||||
CAST(column AS BIGINT) -- Returns NULL on failure in Hive
|
||||
```
|
||||
|
||||
## Complete Migration Examples
|
||||
|
||||
### Example 1: Memory Error with Large Aggregation
|
||||
|
||||
**Original Trino query (fails with memory error):**
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
APPROX_DISTINCT(session_id) as sessions,
|
||||
COUNT(*) as events
|
||||
FROM events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31')
|
||||
GROUP BY user_id, TD_TIME_STRING(time, 'd!', 'JST')
|
||||
```
|
||||
|
||||
**Converted to Hive (works):**
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
approx_distinct(session_id) as sessions, -- Hivemall - same as Trino!
|
||||
COUNT(*) as events
|
||||
FROM events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31', 'JST')
|
||||
GROUP BY user_id, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')
|
||||
```
|
||||
|
||||
### Example 2: Complex Multi-Way JOIN
|
||||
|
||||
**Original Trino query (times out):**
|
||||
```sql
|
||||
SELECT
|
||||
e.user_id,
|
||||
u.user_name,
|
||||
p.product_name,
|
||||
o.order_total
|
||||
FROM events e
|
||||
JOIN users u ON e.user_id = u.user_id
|
||||
JOIN products p ON e.product_id = p.product_id
|
||||
JOIN orders o ON e.order_id = o.order_id
|
||||
WHERE TD_INTERVAL(e.time, '-3M', 'JST')
|
||||
AND REGEXP_LIKE(e.event_type, 'purchase|checkout')
|
||||
```
|
||||
|
||||
**Converted to Hive (works):**
|
||||
```sql
|
||||
SELECT
|
||||
e.user_id,
|
||||
u.user_name,
|
||||
p.product_name,
|
||||
o.order_total
|
||||
FROM events e
|
||||
JOIN users u ON e.user_id = u.user_id
|
||||
JOIN products p ON e.product_id = p.product_id
|
||||
JOIN orders o ON e.order_id = o.order_id
|
||||
WHERE TD_INTERVAL(e.time, '-3M', 'JST')
|
||||
AND e.event_type RLIKE 'purchase|checkout'
|
||||
```
|
||||
|
||||
### Example 3: High Cardinality GROUP BY
|
||||
|
||||
**Original Trino query (memory exceeded):**
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
session_id,
|
||||
page_url,
|
||||
referrer,
|
||||
COUNT(*) as page_views
|
||||
FROM page_views
|
||||
WHERE TD_INTERVAL(time, '-6M', 'JST')
|
||||
GROUP BY user_id, session_id, page_url, referrer
|
||||
```
|
||||
|
||||
**Converted to Hive (works):**
|
||||
```sql
|
||||
-- Same syntax, but Hive handles high cardinality better
|
||||
SELECT
|
||||
user_id,
|
||||
session_id,
|
||||
page_url,
|
||||
referrer,
|
||||
COUNT(*) as page_views
|
||||
FROM page_views
|
||||
WHERE TD_INTERVAL(time, '-6M', 'JST')
|
||||
GROUP BY user_id, session_id, page_url, referrer
|
||||
```
|
||||
|
||||
## Common Syntax Differences Summary
|
||||
|
||||
| Feature | Trino | Hive |
|
||||
|---------|-------|------|
|
||||
| Execution engine | In-memory | Tez (optimized MapReduce) |
|
||||
| Time formatting | `TD_TIME_STRING(time, 'd!', 'JST')` | `TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')` |
|
||||
| Approximate distinct | `APPROX_DISTINCT(col)` | `approx_distinct(col)` (Hivemall - compatible!) |
|
||||
| Approximate percentile | `APPROX_PERCENTILE(col, 0.95)` | `PERCENTILE(col, 0.95)` |
|
||||
| Array aggregation | `ARRAY_AGG(col)` | `COLLECT_LIST(col)` |
|
||||
| String aggregation | `STRING_AGG(col, ',')` | `CONCAT_WS(',', COLLECT_LIST(col))` |
|
||||
| Regex matching | `REGEXP_LIKE(col, 'pattern')` | `col RLIKE 'pattern'` |
|
||||
| Try cast | `TRY_CAST(col AS type)` | `CAST(col AS type)` (returns NULL on failure) |
|
||||
| Small table join hint | Automatic | `/*+ MAPJOIN(table) */` |
|
||||
|
||||
## Migration Checklist
|
||||
|
||||
Before converting from Trino to Hive:
|
||||
|
||||
- [ ] Confirm memory error or timeout in Trino
|
||||
- [ ] Try Trino optimization first (time filters, column reduction, APPROX functions)
|
||||
- [ ] Replace `TD_TIME_STRING` with `TD_TIME_FORMAT`
|
||||
- [ ] Keep `approx_distinct` as-is (compatible via Hivemall!) or use `COUNT(DISTINCT)` for exact counts
|
||||
- [ ] Replace `REGEXP_LIKE` with `RLIKE`
|
||||
- [ ] Replace `ARRAY_AGG` with `COLLECT_LIST`
|
||||
- [ ] Replace `STRING_AGG` with `CONCAT_WS` + `COLLECT_LIST`
|
||||
- [ ] Add `/*+ MAPJOIN */` hints for small lookup tables
|
||||
- [ ] Test query on small time range first
|
||||
- [ ] Verify results match expected output
|
||||
|
||||
**Note:** TD's Hive uses Apache Tez for execution (faster than classic MapReduce) and includes Hivemall library for machine learning and approximate functions.
|
||||
|
||||
## Performance Tips for Hive
|
||||
|
||||
Once migrated to Hive, optimize with these techniques:
|
||||
|
||||
### 1. Use MAPJOIN for Small Tables
|
||||
|
||||
```sql
|
||||
SELECT /*+ MAPJOIN(small_table) */ *
|
||||
FROM large_table l
|
||||
JOIN small_table s ON l.id = s.id
|
||||
WHERE TD_INTERVAL(l.time, '-1M', 'JST')
|
||||
```
|
||||
|
||||
### 2. Enable Dynamic Partitioning
|
||||
|
||||
```sql
|
||||
SET hive.exec.dynamic.partition = true;
|
||||
SET hive.exec.dynamic.partition.mode = nonstrict;
|
||||
|
||||
INSERT OVERWRITE TABLE target_table PARTITION(dt)
|
||||
SELECT *, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as dt
|
||||
FROM source_table
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
```
|
||||
|
||||
### 3. Process in Chunks for Very Large Datasets
|
||||
|
||||
```sql
|
||||
-- Process 1 day at a time instead of entire year
|
||||
INSERT INTO summary_table
|
||||
SELECT ...
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
-- Run daily, much more stable than processing full year
|
||||
```
|
||||
|
||||
### 4. Use LIMIT During Development
|
||||
|
||||
```sql
|
||||
-- Test logic first
|
||||
SELECT ...
|
||||
FROM events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
LIMIT 1000
|
||||
```
|
||||
|
||||
## When Migration Doesn't Help
|
||||
|
||||
If Hive also fails or is too slow:
|
||||
|
||||
1. **Break into smaller time ranges**: Process monthly instead of yearly
|
||||
2. **Pre-aggregate data**: Create intermediate summary tables
|
||||
3. **Reduce dimensions**: Fewer GROUP BY columns
|
||||
4. **Sample data**: Use `rand() < 0.1` for 10% sample
|
||||
5. **Incremental processing**: Process new data only, merge with historical
|
||||
|
||||
## Switching Between Engines in TD
|
||||
|
||||
### In TD Console:
|
||||
- Change "Engine" dropdown from "Presto" to "Hive"
|
||||
|
||||
### In Digdag Workflows:
|
||||
```yaml
|
||||
# Trino
|
||||
+query_trino:
|
||||
td>: queries/analysis.sql
|
||||
engine: presto # or trino
|
||||
|
||||
# Hive
|
||||
+query_hive:
|
||||
td>: queries/analysis.sql
|
||||
engine: hive
|
||||
```
|
||||
|
||||
### In TD Toolbelt:
|
||||
```bash
|
||||
# Trino
|
||||
td query -d database_name -T presto "SELECT ..."
|
||||
|
||||
# Hive
|
||||
td query -d database_name -T hive "SELECT ..."
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with Trino**: Try optimization first before switching to Hive
|
||||
2. **Use Hive for batch jobs**: Schedule heavy processing in Hive
|
||||
3. **Keep interactive queries in Trino**: Better for ad-hoc analysis
|
||||
4. **Test conversions**: Verify results match between engines
|
||||
5. **Document engine choice**: Note why Hive was needed (for future reference)
|
||||
6. **Monitor execution**: Check Hive job progress in TD Console
|
||||
7. **Set expectations**: Hive takes longer but handles larger data
|
||||
|
||||
## Common Issues After Migration
|
||||
|
||||
### Issue: Query Still Fails in Hive
|
||||
|
||||
**Solutions:**
|
||||
- Reduce time range further
|
||||
- Process incrementally (day by day)
|
||||
- Reduce GROUP BY cardinality
|
||||
- Use sampling for estimation
|
||||
|
||||
### Issue: Hive Much Slower Than Expected
|
||||
|
||||
**Solutions:**
|
||||
- Add MAPJOIN hints for small tables
|
||||
- Ensure time filters are present
|
||||
- Select only needed columns
|
||||
- Check if Trino optimization would work after all
|
||||
|
||||
### Issue: Results Different Between Trino and Hive
|
||||
|
||||
**Check:**
|
||||
- Approximate vs exact functions
|
||||
- NULL handling differences
|
||||
- Timezone in time functions
|
||||
- Array/string function behavior
|
||||
|
||||
## Examples: Memory Error Resolution
|
||||
|
||||
### Before (Trino - Memory Error):
|
||||
```
|
||||
Query exceeded per-node memory limit
|
||||
```
|
||||
|
||||
### After (Hive - Success):
|
||||
```sql
|
||||
-- Same query structure, just use Hive engine
|
||||
-- Typically 2-5x slower but completes successfully
|
||||
```
|
||||
|
||||
### Typical Timeline:
|
||||
- Trino: Fails after 5 minutes with memory error
|
||||
- Hive: Completes in 20-30 minutes successfully
|
||||
|
||||
## Resources
|
||||
|
||||
- TD Console: Switch between engines easily
|
||||
- Check query logs to see memory usage patterns
|
||||
- Use EXPLAIN in both engines to understand execution plans
|
||||
- Monitor long-running Hive jobs in TD Console
|
||||
308
skills/trino/SKILL.md
Normal file
308
skills/trino/SKILL.md
Normal file
@@ -0,0 +1,308 @@
|
||||
---
|
||||
name: trino
|
||||
description: Expert assistance for writing, analyzing, and optimizing Trino SQL queries for Treasure Data. Use this skill when users need help with Trino queries, performance optimization, or TD-specific SQL patterns.
|
||||
---
|
||||
|
||||
# Trino SQL Expert
|
||||
|
||||
Expert assistance for writing and optimizing Trino SQL queries for Treasure Data environments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Writing new Trino SQL queries for TD
|
||||
- Optimizing existing Trino queries for performance
|
||||
- Debugging Trino query errors or issues
|
||||
- Converting queries from other SQL dialects to Trino
|
||||
- Implementing TD best practices for data processing
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. TD Table Naming Conventions
|
||||
|
||||
Always use the TD table format:
|
||||
```sql
|
||||
SELECT * FROM database_name.table_name
|
||||
```
|
||||
|
||||
### 2. Partitioning and Time-based Queries
|
||||
|
||||
TD tables are typically partitioned by time. Always include time filters for performance:
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM database_name.table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
```
|
||||
|
||||
Or use relative time ranges:
|
||||
```sql
|
||||
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME())
|
||||
```
|
||||
|
||||
### 3. Performance Optimization
|
||||
|
||||
**Use APPROX functions for large datasets:**
|
||||
```sql
|
||||
SELECT
|
||||
APPROX_DISTINCT(user_id) as unique_users,
|
||||
APPROX_PERCENTILE(response_time, 0.95) as p95_response
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
**Partition pruning:**
|
||||
```sql
|
||||
-- Good: Filters on partition column
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02')
|
||||
|
||||
-- Avoid: Non-partition column filters without time filter
|
||||
WHERE event_type = 'click' -- Missing time filter!
|
||||
```
|
||||
|
||||
**Limit data scanned:**
|
||||
```sql
|
||||
-- Use LIMIT for exploratory queries
|
||||
SELECT * FROM table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
LIMIT 1000
|
||||
```
|
||||
|
||||
### 4. Common TD Functions
|
||||
|
||||
**TD_INTERVAL** - Simplified relative time filtering (Recommended):
|
||||
```sql
|
||||
-- Current day
|
||||
WHERE TD_INTERVAL(time, '1d', 'JST')
|
||||
|
||||
-- Yesterday
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Previous week
|
||||
WHERE TD_INTERVAL(time, '-1w', 'JST')
|
||||
|
||||
-- Previous month
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
|
||||
-- 2 days ago (offset syntax)
|
||||
WHERE TD_INTERVAL(time, '-1d/-1d', 'JST')
|
||||
|
||||
-- 3 months ago (combined offset)
|
||||
WHERE TD_INTERVAL(time, '-1M/-2M', 'JST')
|
||||
```
|
||||
|
||||
**Note:** TD_INTERVAL simplifies relative time queries and is preferred over combining TD_TIME_RANGE with TD_DATE_TRUNC. Cannot accept TD_SCHEDULED_TIME as first argument, but including TD_SCHEDULED_TIME elsewhere in the query establishes the reference date.
|
||||
|
||||
**TD_TIME_RANGE** - Filter by time partitions (explicit dates):
|
||||
```sql
|
||||
TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
TD_TIME_RANGE(time, '2024-01-01') -- Single day
|
||||
```
|
||||
|
||||
**TD_SCHEDULED_TIME()** - Get scheduled execution time:
|
||||
```sql
|
||||
TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d') -- Yesterday
|
||||
```
|
||||
|
||||
**TD_TIME_STRING** - Format timestamps (Recommended):
|
||||
```sql
|
||||
-- Uses simple format codes instead of full format strings
|
||||
TD_TIME_STRING(time, 'd!', 'JST') -- Returns: 2018-09-13
|
||||
TD_TIME_STRING(time, 's!', 'UTC') -- Returns: 2018-09-13 16:45:34
|
||||
TD_TIME_STRING(time, 'M!', 'JST') -- Returns: 2018-09 (year-month)
|
||||
TD_TIME_STRING(time, 'h!', 'UTC') -- Returns: 2018-09-13 16 (year-month-day hour)
|
||||
|
||||
-- With timezone in output (without ! suffix)
|
||||
TD_TIME_STRING(time, 'd', 'JST') -- Returns: 2018-09-13 00:00:00+0900
|
||||
TD_TIME_STRING(time, 's', 'UTC') -- Returns: 2018-09-13 16:45:34+0000
|
||||
```
|
||||
|
||||
**Format codes:**
|
||||
- `y!` = yyyy (year only)
|
||||
- `q!` = yyyy-MM (quarter start)
|
||||
- `M!` = yyyy-MM (month)
|
||||
- `w!` = yyyy-MM-dd (week start)
|
||||
- `d!` = yyyy-MM-dd (day)
|
||||
- `h!` = yyyy-MM-dd HH (hour)
|
||||
- `m!` = yyyy-MM-dd HH:mm (minute)
|
||||
- `s!` = yyyy-MM-dd HH:mm:ss (second)
|
||||
- Without the exclamation mark suffix, timezone offset is included
|
||||
|
||||
**TD_TIME_FORMAT** - Format timestamps (Legacy, use TD_TIME_STRING instead):
|
||||
```sql
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'UTC')
|
||||
```
|
||||
|
||||
**TD_SESSIONIZE** - Sessionize events:
|
||||
```sql
|
||||
SELECT TD_SESSIONIZE(time, 1800, user_id) as session_id
|
||||
FROM events
|
||||
```
|
||||
|
||||
### 5. JOIN Optimization
|
||||
|
||||
**Put smaller table on the right side:**
|
||||
```sql
|
||||
-- Good
|
||||
SELECT *
|
||||
FROM large_table l
|
||||
JOIN small_table s ON l.id = s.id
|
||||
|
||||
-- Consider table size when joining
|
||||
```
|
||||
|
||||
**Use appropriate JOIN types:**
|
||||
```sql
|
||||
-- INNER JOIN for matching records only
|
||||
-- LEFT JOIN when you need all records from left table
|
||||
-- Avoid FULL OUTER JOIN when possible (expensive)
|
||||
```
|
||||
|
||||
### 6. Data Types and Casting
|
||||
|
||||
Be explicit with data types:
|
||||
```sql
|
||||
CAST(column_name AS BIGINT)
|
||||
CAST(column_name AS VARCHAR)
|
||||
CAST(column_name AS DOUBLE)
|
||||
TRY_CAST(column_name AS BIGINT) -- Returns NULL on failure
|
||||
```
|
||||
|
||||
### 7. Window Functions
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
event_time,
|
||||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as event_seq,
|
||||
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
|
||||
FROM events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### User Event Analysis
|
||||
```sql
|
||||
-- Using TD_INTERVAL for last month
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
event_type,
|
||||
COUNT(*) as event_count,
|
||||
APPROX_DISTINCT(user_id) as unique_users
|
||||
FROM database_name.events
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
AND event_type IN ('page_view', 'click', 'purchase')
|
||||
GROUP BY 1, 2
|
||||
ORDER BY 1, 2
|
||||
```
|
||||
|
||||
**Alternative with explicit date range:**
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
event_type,
|
||||
COUNT(*) as event_count,
|
||||
APPROX_DISTINCT(user_id) as unique_users
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
AND event_type IN ('page_view', 'click', 'purchase')
|
||||
GROUP BY 1, 2
|
||||
ORDER BY 1, 2
|
||||
```
|
||||
|
||||
### Conversion Funnel
|
||||
```sql
|
||||
WITH events_filtered AS (
|
||||
SELECT
|
||||
user_id,
|
||||
event_type,
|
||||
time
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
)
|
||||
SELECT
|
||||
COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END) as step1_users,
|
||||
COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN user_id END) as step2_users,
|
||||
COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) as step3_users
|
||||
FROM events_filtered
|
||||
```
|
||||
|
||||
### Daily Aggregation
|
||||
```sql
|
||||
-- Using TD_INTERVAL for yesterday's data
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
COUNT(*) as total_events,
|
||||
APPROX_DISTINCT(user_id) as daily_active_users,
|
||||
AVG(session_duration) as avg_session_duration
|
||||
FROM database_name.events
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
AND TD_SCHEDULED_TIME() IS NOT NULL -- Establishes reference date for TD_INTERVAL
|
||||
GROUP BY 1
|
||||
ORDER BY 1
|
||||
```
|
||||
|
||||
**For rolling 30-day window:**
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_STRING(time, 'd!', 'JST') as date,
|
||||
COUNT(*) as total_events,
|
||||
APPROX_DISTINCT(user_id) as daily_active_users,
|
||||
AVG(session_duration) as avg_session_duration
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-30d'), TD_SCHEDULED_TIME())
|
||||
GROUP BY 1
|
||||
ORDER BY 1
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
**"Line X:Y: Column 'time' cannot be resolved"**
|
||||
- Ensure table name is correct
|
||||
- Check that column exists in table schema
|
||||
|
||||
**"Query exceeded memory limit"**
|
||||
- Add time filters with TD_TIME_RANGE
|
||||
- Use APPROX_ functions instead of exact aggregations
|
||||
- Reduce JOIN complexity or data volume
|
||||
|
||||
**"Partition not found"**
|
||||
- Verify time range covers existing partitions
|
||||
- Check TD_TIME_RANGE syntax
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always include time filters** using TD_INTERVAL or TD_TIME_RANGE for partition pruning
|
||||
- Use TD_INTERVAL for relative dates: `WHERE TD_INTERVAL(time, '-1d', 'JST')`
|
||||
- Use TD_TIME_RANGE for explicit dates: `WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')`
|
||||
- Never filter by formatted dates: ❌ `WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'`
|
||||
2. **Use TD_TIME_STRING for display only**, not for filtering
|
||||
- ✅ `SELECT TD_TIME_STRING(time, 'd!', 'JST') as date`
|
||||
- ❌ `WHERE TD_TIME_STRING(time, 'd!', 'JST') = '2024-01-01'`
|
||||
3. **Use APPROX functions** for large-scale aggregations (APPROX_DISTINCT, APPROX_PERCENTILE)
|
||||
4. **Limit exploratory queries** to reduce costs and scan time
|
||||
5. **Test queries on small time ranges** before running on full dataset
|
||||
6. **Use CTEs (WITH clauses)** for complex queries to improve readability
|
||||
7. **Add comments** explaining business logic
|
||||
8. **Consider materialized results** for frequently-run queries
|
||||
|
||||
## Example Workflow
|
||||
|
||||
When helping users write Trino queries:
|
||||
|
||||
1. **Understand the requirement** - What data do they need?
|
||||
2. **Identify tables** - Which TD tables contain the data?
|
||||
3. **Add time filters** - What time range is needed?
|
||||
4. **Write base query** - Start with simple SELECT
|
||||
5. **Add aggregations** - Use appropriate functions
|
||||
6. **Optimize** - Apply performance best practices
|
||||
7. **Test** - Validate results on small dataset first
|
||||
|
||||
## Resources
|
||||
|
||||
- Trino SQL documentation: https://trino.io/docs/current/
|
||||
- TD-specific functions: Check internal TD documentation
|
||||
- Query performance: Use EXPLAIN for query plans
|
||||
Reference in New Issue
Block a user