gh-treasure-data-td-skills-…/skills/hive/SKILL.md

---
name: hive
description: Expert assistance for writing, analyzing, and optimizing Hive SQL queries for Treasure Data. Use this skill when users need help with Hive queries, MapReduce optimization, or legacy TD Hive workflows.
---

# Hive SQL Expert

Expert assistance for writing and optimizing Hive SQL queries for Treasure Data environments.

## When to Use This Skill

Use this skill when:
- Writing Hive SQL queries for TD
- Maintaining or updating legacy Hive workflows
- Optimizing Hive query performance
- Converting queries to/from Hive dialect
- Working with Hive-specific features (SerDes, UDFs, etc.)

## Core Principles

### 1. TD Table Access

Access TD tables using database.table notation:
```sql
SELECT * FROM database_name.table_name
```

### 2. Time-based Partitioning

TD Hive tables are partitioned by time. Always use time predicates:

```sql
SELECT *
FROM database_name.table_name
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
```

Unix timestamp format:
```sql
WHERE time >= unix_timestamp('2024-01-01 00:00:00')
  AND time < unix_timestamp('2024-01-02 00:00:00')
```

### 3. Performance Optimization

**Use columnar formats:**
- TD tables are typically stored in columnar format (ORC/Parquet)
- Select only needed columns to reduce I/O

**Partition pruning:**
```sql
-- Good: Uses partition columns
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02')

-- Good: Direct time filter
WHERE time >= 1704067200 AND time < 1704153600
```

**Limit during development:**
```sql
SELECT * FROM table_name
WHERE TD_TIME_RANGE(time, '2024-01-01')
LIMIT 1000
```

### 4. Common TD Hive Functions

**TD_INTERVAL** - Simplified relative time filtering (Recommended):
```sql
-- Current day
WHERE TD_INTERVAL(time, '1d', 'JST')

-- Yesterday
WHERE TD_INTERVAL(time, '-1d', 'JST')

-- Previous week
WHERE TD_INTERVAL(time, '-1w', 'JST')

-- Previous month
WHERE TD_INTERVAL(time, '-1M', 'JST')

-- 2 days ago (offset syntax)
WHERE TD_INTERVAL(time, '-1d/-1d', 'JST')

-- 3 months ago (combined offset)
WHERE TD_INTERVAL(time, '-1M/-2M', 'JST')
```

**Note:** TD_INTERVAL simplifies relative time queries and is preferred over combining TD_TIME_RANGE with TD_DATE_TRUNC. Cannot accept TD_SCHEDULED_TIME as first argument, but including TD_SCHEDULED_TIME elsewhere in the query establishes the reference date.

**TD_TIME_RANGE** - Partition-aware time filtering (explicit dates):
```sql
TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST')  -- Open-ended
```

**TD_SCHEDULED_TIME()** - Get workflow execution time:
```sql
SELECT TD_SCHEDULED_TIME()
-- Returns Unix timestamp of scheduled run
```

**TD_TIME_FORMAT** - Format Unix timestamps:
```sql
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'JST')
```

**TD_TIME_PARSE** - Parse string to Unix timestamp:
```sql
SELECT TD_TIME_PARSE('2024-01-01', 'JST')
```

**TD_DATE_TRUNC** - Truncate timestamp to day/hour/etc:
```sql
SELECT TD_DATE_TRUNC('day', time, 'JST')
SELECT TD_DATE_TRUNC('hour', time, 'UTC')
```

### 5. JOIN Optimization

**MapReduce JOIN strategies:**

```sql
-- Map-side JOIN for small tables (use /*+ MAPJOIN */ hint)
SELECT /*+ MAPJOIN(small_table) */
  l.*,
  s.attribute
FROM large_table l
JOIN small_table s ON l.id = s.id
WHERE TD_TIME_RANGE(l.time, '2024-01-01')
```

**Reduce-side JOIN:**
```sql
-- Default for large-to-large joins
SELECT *
FROM table1 t1
JOIN table2 t2 ON t1.key = t2.key
WHERE TD_TIME_RANGE(t1.time, '2024-01-01')
  AND TD_TIME_RANGE(t2.time, '2024-01-01')
```

### 6. Aggregations

**Standard aggregations:**
```sql
SELECT
  TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
  COUNT(*) as total_count,
  COUNT(DISTINCT user_id) as unique_users,
  AVG(value) as avg_value,
  SUM(amount) as total_amount
FROM database_name.events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')
```

**Approximate aggregations for large datasets:**
```sql
-- Not built-in, but can use sampling
SELECT COUNT(*) * 10 as estimated_count
FROM table_name
WHERE TD_TIME_RANGE(time, '2024-01-01')
  AND rand() < 0.1  -- 10% sample
```

### 7. Data Types and Casting

Hive type casting:
```sql
CAST(column_name AS BIGINT)
CAST(column_name AS STRING)
CAST(column_name AS DOUBLE)
CAST(column_name AS DECIMAL(10,2))
```

### 8. Window Functions

```sql
SELECT
  user_id,
  event_time,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as event_seq,
  LAG(event_time, 1) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01')
```

### 9. Array and Map Operations

**Array functions:**
```sql
SELECT
  array_contains(tags, 'premium') as is_premium,
  size(tags) as tag_count,
  tags[0] as first_tag
FROM user_profiles
```

**Map functions:**
```sql
SELECT
  map_keys(attributes) as attribute_names,
  map_values(attributes) as attribute_values,
  attributes['country'] as country
FROM events
```

## Common Patterns

### Daily Event Aggregation
```sql
SELECT
  TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
  event_type,
  COUNT(*) as event_count,
  COUNT(DISTINCT user_id) as unique_users
FROM database_name.events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
GROUP BY
  TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST'),
  event_type
ORDER BY date, event_type
```

### User Segmentation
```sql
SELECT
  CASE
    WHEN purchase_count >= 10 THEN 'high_value'
    WHEN purchase_count >= 5 THEN 'medium_value'
    ELSE 'low_value'
  END as segment,
  COUNT(*) as user_count,
  AVG(total_spend) as avg_spend
FROM (
  SELECT
    user_id,
    COUNT(*) as purchase_count,
    SUM(amount) as total_spend
  FROM database_name.purchases
  WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
  GROUP BY user_id
) user_stats
GROUP BY
  CASE
    WHEN purchase_count >= 10 THEN 'high_value'
    WHEN purchase_count >= 5 THEN 'medium_value'
    ELSE 'low_value'
  END
```

### Session Analysis
```sql
SELECT
  user_id,
  session_id,
  MIN(time) as session_start,
  MAX(time) as session_end,
  COUNT(*) as events_in_session
FROM (
  SELECT
    user_id,
    time,
    SUM(is_new_session) OVER (
      PARTITION BY user_id
      ORDER BY time
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as session_id
  FROM (
    SELECT
      user_id,
      time,
      CASE
        WHEN time - LAG(time) OVER (PARTITION BY user_id ORDER BY time) > 1800
          OR LAG(time) OVER (PARTITION BY user_id ORDER BY time) IS NULL
        THEN 1
        ELSE 0
      END as is_new_session
    FROM database_name.events
    WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-02', 'JST')
  ) with_session_flag
) with_session_id
GROUP BY user_id, session_id
```

### Cohort Analysis
```sql
WITH first_purchase AS (
  SELECT
    user_id,
    TD_TIME_FORMAT(MIN(time), 'yyyy-MM', 'JST') as cohort_month
  FROM database_name.purchases
  WHERE TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST')
  GROUP BY user_id
),
monthly_purchases AS (
  SELECT
    user_id,
    TD_TIME_FORMAT(time, 'yyyy-MM', 'JST') as purchase_month,
    SUM(amount) as monthly_spend
  FROM database_name.purchases
  WHERE TD_TIME_RANGE(time, '2024-01-01', NULL, 'JST')
  GROUP BY user_id, TD_TIME_FORMAT(time, 'yyyy-MM', 'JST')
)
SELECT
  f.cohort_month,
  m.purchase_month,
  COUNT(DISTINCT m.user_id) as active_users,
  SUM(m.monthly_spend) as total_spend
FROM first_purchase f
JOIN monthly_purchases m ON f.user_id = m.user_id
GROUP BY f.cohort_month, m.purchase_month
ORDER BY f.cohort_month, m.purchase_month
```

## Hive-Specific Features

### SerDe (Serializer/Deserializer)

When working with JSON data:
```sql
-- Usually handled automatically in TD, but awareness is important
-- JSON SerDe allows querying nested JSON structures
SELECT
  get_json_object(json_column, '$.user.id') as user_id,
  get_json_object(json_column, '$.event.type') as event_type
FROM raw_events
```

### LATERAL VIEW with EXPLODE

Flatten arrays:
```sql
SELECT
  user_id,
  tag
FROM user_profiles
LATERAL VIEW EXPLODE(tags) tags_table AS tag
WHERE TD_TIME_RANGE(time, '2024-01-01')
```

Multiple LATERAL VIEWs:
```sql
SELECT
  user_id,
  tag,
  category
FROM user_profiles
LATERAL VIEW EXPLODE(tags) tags_table AS tag
LATERAL VIEW EXPLODE(categories) cat_table AS category
```

### Dynamic Partitioning

When creating tables (less common in TD):
```sql
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

INSERT OVERWRITE TABLE target_table PARTITION(dt)
SELECT *, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as dt
FROM source_table
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
```

## Error Handling

### Common Errors

**"FAILED: SemanticException Column time does not exist"**
- Check table schema
- Ensure table name is correct

**"OutOfMemoryError: Java heap space"**
- Reduce time range in query
- Use LIMIT for testing
- Optimize JOINs (use MAPJOIN hint for small tables)

**"Too many dynamic partitions"**
- Reduce partition count
- Check dynamic partition settings

**"Expression not in GROUP BY key"**
- All non-aggregated columns must be in GROUP BY
- Or use aggregate functions (MAX, MIN, etc.)

## Best Practices

1. **Always use time filters** with TD_TIME_RANGE or direct time comparisons
2. **Select only needed columns** to reduce I/O
3. **Use MAPJOIN hint** for small table joins
4. **Test on small time ranges** before full runs
5. **Use appropriate timezone** (JST for Japan data)
6. **Avoid SELECT *** in production queries
7. **Use CTEs (WITH clauses)** for complex queries
8. **Consider data volume** - Hive is batch-oriented
9. **Monitor query progress** in TD console
10. **Add comments** explaining business logic

## Migration Notes: Hive to Trino

When migrating from Hive to Trino:
- Most syntax is compatible
- Trino is generally faster for interactive queries
- Some Hive UDFs may need replacement
- Window functions syntax is similar
- Approximate functions (APPROX_*) are more efficient in Trino

## Example Workflow

When helping users write Hive queries:

1. **Understand requirements** - What analysis is needed?
2. **Identify tables** - Which TD tables to query?
3. **Add time filters** - Always include TD_TIME_RANGE
4. **Write base query** - Start simple
5. **Add transformations** - Aggregations, JOINs, etc.
6. **Optimize** - Use MAPJOIN hints, select only needed columns
7. **Test** - Run on small dataset first
8. **Scale** - Extend to full time range

## Resources

- Hive documentation: https://cwiki.apache.org/confluence/display/Hive
- TD Hive functions: Check internal TD documentation
- Consider migrating to Trino for better performance