Initial commit
This commit is contained in:
427
skills/hive/SKILL.md
Normal file
427
skills/hive/SKILL.md
Normal file
@@ -0,0 +1,427 @@
|
||||
---
|
||||
name: hive
|
||||
description: Expert assistance for writing, analyzing, and optimizing Hive SQL queries for Treasure Data. Use this skill when users need help with Hive queries, MapReduce optimization, or legacy TD Hive workflows.
|
||||
---
|
||||
|
||||
# Hive SQL Expert
|
||||
|
||||
Expert assistance for writing and optimizing Hive SQL queries for Treasure Data environments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Writing Hive SQL queries for TD
|
||||
- Maintaining or updating legacy Hive workflows
|
||||
- Optimizing Hive query performance
|
||||
- Converting queries to/from Hive dialect
|
||||
- Working with Hive-specific features (SerDes, UDFs, etc.)
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. TD Table Access
|
||||
|
||||
Access TD tables using database.table notation:
|
||||
```sql
|
||||
SELECT * FROM database_name.table_name
|
||||
```
|
||||
|
||||
### 2. Time-based Partitioning
|
||||
|
||||
TD Hive tables are partitioned by time. Always use time predicates:
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM database_name.table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
```
|
||||
|
||||
Unix timestamp format:
|
||||
```sql
|
||||
WHERE time >= unix_timestamp('2024-01-01 00:00:00')
|
||||
AND time < unix_timestamp('2024-01-02 00:00:00')
|
||||
```
|
||||
|
||||
### 3. Performance Optimization
|
||||
|
||||
**Use columnar formats:**
|
||||
- TD tables are typically stored in columnar format (ORC/Parquet)
|
||||
- Select only needed columns to reduce I/O
|
||||
|
||||
**Partition pruning:**
|
||||
```sql
|
||||
-- Good: Uses partition columns
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02')
|
||||
|
||||
-- Good: Direct time filter
|
||||
WHERE time >= 1704067200 AND time < 1704153600
|
||||
```
|
||||
|
||||
**Limit during development:**
|
||||
```sql
|
||||
SELECT * FROM table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
LIMIT 1000
|
||||
```
|
||||
|
||||
### 4. Common TD Hive Functions
|
||||
|
||||
**TD_INTERVAL** - Simplified relative time filtering (Recommended):
|
||||
```sql
|
||||
-- Current day
|
||||
WHERE TD_INTERVAL(time, '1d', 'JST')
|
||||
|
||||
-- Yesterday
|
||||
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||
|
||||
-- Previous week
|
||||
WHERE TD_INTERVAL(time, '-1w', 'JST')
|
||||
|
||||
-- Previous month
|
||||
WHERE TD_INTERVAL(time, '-1M', 'JST')
|
||||
|
||||
-- 2 days ago (offset syntax)
|
||||
WHERE TD_INTERVAL(time, '-1d/-1d', 'JST')
|
||||
|
||||
-- 3 months ago (combined offset)
|
||||
WHERE TD_INTERVAL(time, '-1M/-2M', 'JST')
|
||||
```
|
||||
|
||||
**Note:** TD_INTERVAL simplifies relative time queries and is preferred over combining TD_TIME_RANGE with TD_DATE_TRUNC. Cannot accept TD_SCHEDULED_TIME as first argument, but including TD_SCHEDULED_TIME elsewhere in the query establishes the reference date.
|
||||
|
||||
**TD_TIME_RANGE** - Partition-aware time filtering (explicit dates):
|
||||
```sql
|
||||
TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST') -- Open-ended
|
||||
```
|
||||
|
||||
**TD_SCHEDULED_TIME()** - Get workflow execution time:
|
||||
```sql
|
||||
SELECT TD_SCHEDULED_TIME()
|
||||
-- Returns Unix timestamp of scheduled run
|
||||
```
|
||||
|
||||
**TD_TIME_FORMAT** - Format Unix timestamps:
|
||||
```sql
|
||||
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'JST')
|
||||
```
|
||||
|
||||
**TD_TIME_PARSE** - Parse string to Unix timestamp:
|
||||
```sql
|
||||
SELECT TD_TIME_PARSE('2024-01-01', 'JST')
|
||||
```
|
||||
|
||||
**TD_DATE_TRUNC** - Truncate timestamp to day/hour/etc:
|
||||
```sql
|
||||
SELECT TD_DATE_TRUNC('day', time, 'JST')
|
||||
SELECT TD_DATE_TRUNC('hour', time, 'UTC')
|
||||
```
|
||||
|
||||
### 5. JOIN Optimization
|
||||
|
||||
**MapReduce JOIN strategies:**
|
||||
|
||||
```sql
|
||||
-- Map-side JOIN for small tables (use /*+ MAPJOIN */ hint)
|
||||
SELECT /*+ MAPJOIN(small_table) */
|
||||
l.*,
|
||||
s.attribute
|
||||
FROM large_table l
|
||||
JOIN small_table s ON l.id = s.id
|
||||
WHERE TD_TIME_RANGE(l.time, '2024-01-01')
|
||||
```
|
||||
|
||||
**Reduce-side JOIN:**
|
||||
```sql
|
||||
-- Default for large-to-large joins
|
||||
SELECT *
|
||||
FROM table1 t1
|
||||
JOIN table2 t2 ON t1.key = t2.key
|
||||
WHERE TD_TIME_RANGE(t1.time, '2024-01-01')
|
||||
AND TD_TIME_RANGE(t2.time, '2024-01-01')
|
||||
```
|
||||
|
||||
### 6. Aggregations
|
||||
|
||||
**Standard aggregations:**
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
COUNT(*) as total_count,
|
||||
COUNT(DISTINCT user_id) as unique_users,
|
||||
AVG(value) as avg_value,
|
||||
SUM(amount) as total_amount
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')
|
||||
```
|
||||
|
||||
**Approximate aggregations for large datasets:**
|
||||
```sql
|
||||
-- Not built-in, but can use sampling
|
||||
SELECT COUNT(*) * 10 as estimated_count
|
||||
FROM table_name
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
AND rand() < 0.1 -- 10% sample
|
||||
```
|
||||
|
||||
### 7. Data Types and Casting
|
||||
|
||||
Hive type casting:
|
||||
```sql
|
||||
CAST(column_name AS BIGINT)
|
||||
CAST(column_name AS STRING)
|
||||
CAST(column_name AS DOUBLE)
|
||||
CAST(column_name AS DECIMAL(10,2))
|
||||
```
|
||||
|
||||
### 8. Window Functions
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
event_time,
|
||||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as event_seq,
|
||||
LAG(event_time, 1) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
|
||||
FROM events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
### 9. Array and Map Operations
|
||||
|
||||
**Array functions:**
|
||||
```sql
|
||||
SELECT
|
||||
array_contains(tags, 'premium') as is_premium,
|
||||
size(tags) as tag_count,
|
||||
tags[0] as first_tag
|
||||
FROM user_profiles
|
||||
```
|
||||
|
||||
**Map functions:**
|
||||
```sql
|
||||
SELECT
|
||||
map_keys(attributes) as attribute_names,
|
||||
map_values(attributes) as attribute_values,
|
||||
attributes['country'] as country
|
||||
FROM events
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Daily Event Aggregation
|
||||
```sql
|
||||
SELECT
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||
event_type,
|
||||
COUNT(*) as event_count,
|
||||
COUNT(DISTINCT user_id) as unique_users
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
GROUP BY
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST'),
|
||||
event_type
|
||||
ORDER BY date, event_type
|
||||
```
|
||||
|
||||
### User Segmentation
|
||||
```sql
|
||||
SELECT
|
||||
CASE
|
||||
WHEN purchase_count >= 10 THEN 'high_value'
|
||||
WHEN purchase_count >= 5 THEN 'medium_value'
|
||||
ELSE 'low_value'
|
||||
END as segment,
|
||||
COUNT(*) as user_count,
|
||||
AVG(total_spend) as avg_spend
|
||||
FROM (
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) as purchase_count,
|
||||
SUM(amount) as total_spend
|
||||
FROM database_name.purchases
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
|
||||
GROUP BY user_id
|
||||
) user_stats
|
||||
GROUP BY
|
||||
CASE
|
||||
WHEN purchase_count >= 10 THEN 'high_value'
|
||||
WHEN purchase_count >= 5 THEN 'medium_value'
|
||||
ELSE 'low_value'
|
||||
END
|
||||
```
|
||||
|
||||
### Session Analysis
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
session_id,
|
||||
MIN(time) as session_start,
|
||||
MAX(time) as session_end,
|
||||
COUNT(*) as events_in_session
|
||||
FROM (
|
||||
SELECT
|
||||
user_id,
|
||||
time,
|
||||
SUM(is_new_session) OVER (
|
||||
PARTITION BY user_id
|
||||
ORDER BY time
|
||||
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|
||||
) as session_id
|
||||
FROM (
|
||||
SELECT
|
||||
user_id,
|
||||
time,
|
||||
CASE
|
||||
WHEN time - LAG(time) OVER (PARTITION BY user_id ORDER BY time) > 1800
|
||||
OR LAG(time) OVER (PARTITION BY user_id ORDER BY time) IS NULL
|
||||
THEN 1
|
||||
ELSE 0
|
||||
END as is_new_session
|
||||
FROM database_name.events
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02', 'JST')
|
||||
) with_session_flag
|
||||
) with_session_id
|
||||
GROUP BY user_id, session_id
|
||||
```
|
||||
|
||||
### Cohort Analysis
|
||||
```sql
|
||||
WITH first_purchase AS (
|
||||
SELECT
|
||||
user_id,
|
||||
TD_TIME_FORMAT(MIN(time), 'yyyy-MM', 'JST') as cohort_month
|
||||
FROM database_name.purchases
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST')
|
||||
GROUP BY user_id
|
||||
),
|
||||
monthly_purchases AS (
|
||||
SELECT
|
||||
user_id,
|
||||
TD_TIME_FORMAT(time, 'yyyy-MM', 'JST') as purchase_month,
|
||||
SUM(amount) as monthly_spend
|
||||
FROM database_name.purchases
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST')
|
||||
GROUP BY user_id, TD_TIME_FORMAT(time, 'yyyy-MM', 'JST')
|
||||
)
|
||||
SELECT
|
||||
f.cohort_month,
|
||||
m.purchase_month,
|
||||
COUNT(DISTINCT m.user_id) as active_users,
|
||||
SUM(m.monthly_spend) as total_spend
|
||||
FROM first_purchase f
|
||||
JOIN monthly_purchases m ON f.user_id = m.user_id
|
||||
GROUP BY f.cohort_month, m.purchase_month
|
||||
ORDER BY f.cohort_month, m.purchase_month
|
||||
```
|
||||
|
||||
## Hive-Specific Features
|
||||
|
||||
### SerDe (Serializer/Deserializer)
|
||||
|
||||
When working with JSON data:
|
||||
```sql
|
||||
-- Usually handled automatically in TD, but awareness is important
|
||||
-- JSON SerDe allows querying nested JSON structures
|
||||
SELECT
|
||||
get_json_object(json_column, '$.user.id') as user_id,
|
||||
get_json_object(json_column, '$.event.type') as event_type
|
||||
FROM raw_events
|
||||
```
|
||||
|
||||
### LATERAL VIEW with EXPLODE
|
||||
|
||||
Flatten arrays:
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
tag
|
||||
FROM user_profiles
|
||||
LATERAL VIEW EXPLODE(tags) tags_table AS tag
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01')
|
||||
```
|
||||
|
||||
Multiple LATERAL VIEWs:
|
||||
```sql
|
||||
SELECT
|
||||
user_id,
|
||||
tag,
|
||||
category
|
||||
FROM user_profiles
|
||||
LATERAL VIEW EXPLODE(tags) tags_table AS tag
|
||||
LATERAL VIEW EXPLODE(categories) cat_table AS category
|
||||
```
|
||||
|
||||
### Dynamic Partitioning
|
||||
|
||||
When creating tables (less common in TD):
|
||||
```sql
|
||||
SET hive.exec.dynamic.partition = true;
|
||||
SET hive.exec.dynamic.partition.mode = nonstrict;
|
||||
|
||||
INSERT OVERWRITE TABLE target_table PARTITION(dt)
|
||||
SELECT *, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as dt
|
||||
FROM source_table
|
||||
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
**"FAILED: SemanticException Column time does not exist"**
|
||||
- Check table schema
|
||||
- Ensure table name is correct
|
||||
|
||||
**"OutOfMemoryError: Java heap space"**
|
||||
- Reduce time range in query
|
||||
- Use LIMIT for testing
|
||||
- Optimize JOINs (use MAPJOIN hint for small tables)
|
||||
|
||||
**"Too many dynamic partitions"**
|
||||
- Reduce partition count
|
||||
- Check dynamic partition settings
|
||||
|
||||
**"Expression not in GROUP BY key"**
|
||||
- All non-aggregated columns must be in GROUP BY
|
||||
- Or use aggregate functions (MAX, MIN, etc.)
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always use time filters** with TD_TIME_RANGE or direct time comparisons
|
||||
2. **Select only needed columns** to reduce I/O
|
||||
3. **Use MAPJOIN hint** for small table joins
|
||||
4. **Test on small time ranges** before full runs
|
||||
5. **Use appropriate timezone** (JST for Japan data)
|
||||
6. **Avoid SELECT *** in production queries
|
||||
7. **Use CTEs (WITH clauses)** for complex queries
|
||||
8. **Consider data volume** - Hive is batch-oriented
|
||||
9. **Monitor query progress** in TD console
|
||||
10. **Add comments** explaining business logic
|
||||
|
||||
## Migration Notes: Hive to Trino
|
||||
|
||||
When migrating from Hive to Trino:
|
||||
- Most syntax is compatible
|
||||
- Trino is generally faster for interactive queries
|
||||
- Some Hive UDFs may need replacement
|
||||
- Window functions syntax is similar
|
||||
- Approximate functions (APPROX_*) are more efficient in Trino
|
||||
|
||||
## Example Workflow
|
||||
|
||||
When helping users write Hive queries:
|
||||
|
||||
1. **Understand requirements** - What analysis is needed?
|
||||
2. **Identify tables** - Which TD tables to query?
|
||||
3. **Add time filters** - Always include TD_TIME_RANGE
|
||||
4. **Write base query** - Start simple
|
||||
5. **Add transformations** - Aggregations, JOINs, etc.
|
||||
6. **Optimize** - Use MAPJOIN hints, select only needed columns
|
||||
7. **Test** - Run on small dataset first
|
||||
8. **Scale** - Extend to full time range
|
||||
|
||||
## Resources
|
||||
|
||||
- Hive documentation: https://cwiki.apache.org/confluence/display/Hive
|
||||
- TD Hive functions: Check internal TD documentation
|
||||
- Consider migrating to Trino for better performance
|
||||
Reference in New Issue
Block a user