Initial commit
This commit is contained in:
470
skills/confluent-ksqldb/SKILL.md
Normal file
470
skills/confluent-ksqldb/SKILL.md
Normal file
@@ -0,0 +1,470 @@
|
||||
---
|
||||
name: confluent-ksqldb
|
||||
description: ksqlDB stream processing expert. Covers SQL-like queries on Kafka topics, stream and table concepts, joins, aggregations, windowing, materialized views, and real-time data transformations. Activates for ksqldb, ksql, stream processing, kafka sql, real-time analytics, windowing, stream joins, table joins, materialized views.
|
||||
---
|
||||
|
||||
# Confluent ksqlDB Skill
|
||||
|
||||
Expert knowledge of ksqlDB - Confluent's event streaming database for building real-time applications with SQL-like queries on Kafka topics.
|
||||
|
||||
## What I Know
|
||||
|
||||
### Core Concepts
|
||||
|
||||
**Streams** (Unbounded, Append-Only):
|
||||
- Represents immutable event sequences
|
||||
- Every row is a new event
|
||||
- Cannot be updated or deleted
|
||||
- Example: Click events, sensor readings, transactions
|
||||
|
||||
**Tables** (Mutable, Latest State):
|
||||
- Represents current state
|
||||
- Updates override previous values (by key)
|
||||
- Compacted topic under the hood
|
||||
- Example: User profiles, product inventory, account balances
|
||||
|
||||
**Key Difference**:
|
||||
```sql
|
||||
-- STREAM: Every event is independent
|
||||
INSERT INTO clicks_stream (user_id, page, timestamp)
|
||||
VALUES (1, 'homepage', CURRENT_TIMESTAMP());
|
||||
-- Creates NEW row
|
||||
|
||||
-- TABLE: Latest value wins (by key)
|
||||
INSERT INTO users_table (user_id, name, email)
|
||||
VALUES (1, 'John', 'john@example.com');
|
||||
-- UPDATES existing row with user_id=1
|
||||
```
|
||||
|
||||
### Query Types
|
||||
|
||||
**1. Streaming Queries** (Continuous, Real-Time):
|
||||
```sql
|
||||
-- Filter events in real-time
|
||||
SELECT user_id, page, timestamp
|
||||
FROM clicks_stream
|
||||
WHERE page = 'checkout'
|
||||
EMIT CHANGES;
|
||||
|
||||
-- Transform on the fly
|
||||
SELECT
|
||||
user_id,
|
||||
UPPER(page) AS page_upper,
|
||||
TIMESTAMPTOSTRING(timestamp, 'yyyy-MM-dd') AS date
|
||||
FROM clicks_stream
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
**2. Materialized Views** (Pre-Computed Tables):
|
||||
```sql
|
||||
-- Aggregate clicks per user (updates continuously)
|
||||
CREATE TABLE user_click_counts AS
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) AS click_count
|
||||
FROM clicks_stream
|
||||
GROUP BY user_id
|
||||
EMIT CHANGES;
|
||||
|
||||
-- Query the table (instant results!)
|
||||
SELECT * FROM user_click_counts WHERE user_id = 123;
|
||||
```
|
||||
|
||||
**3. Pull Queries** (Point-in-Time Reads):
|
||||
```sql
|
||||
-- Query current state (like traditional SQL)
|
||||
SELECT * FROM users_table WHERE user_id = 123;
|
||||
|
||||
-- No EMIT CHANGES = pull query (returns once)
|
||||
```
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate me when you need help with:
|
||||
- ksqlDB syntax ("How to create ksqlDB stream?")
|
||||
- Stream vs table concepts ("When to use stream vs table?")
|
||||
- Joins ("Join stream with table")
|
||||
- Aggregations ("Count events per user")
|
||||
- Windowing ("Tumbling window aggregation")
|
||||
- Real-time transformations ("Filter and enrich events")
|
||||
- Materialized views ("Create pre-computed aggregates")
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Filter Events
|
||||
|
||||
**Use Case**: Drop irrelevant events early
|
||||
|
||||
```sql
|
||||
-- Create filtered stream
|
||||
CREATE STREAM important_clicks AS
|
||||
SELECT *
|
||||
FROM clicks_stream
|
||||
WHERE page IN ('checkout', 'payment', 'confirmation')
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
### Pattern 2: Enrich Events (Stream-Table Join)
|
||||
|
||||
**Use Case**: Add user details to click events
|
||||
|
||||
```sql
|
||||
-- Users table (current state)
|
||||
CREATE TABLE users (
|
||||
user_id BIGINT PRIMARY KEY,
|
||||
name VARCHAR,
|
||||
email VARCHAR
|
||||
) WITH (
|
||||
kafka_topic='users',
|
||||
value_format='AVRO'
|
||||
);
|
||||
|
||||
-- Enrich clicks with user data
|
||||
CREATE STREAM enriched_clicks AS
|
||||
SELECT
|
||||
c.user_id,
|
||||
c.page,
|
||||
c.timestamp,
|
||||
u.name,
|
||||
u.email
|
||||
FROM clicks_stream c
|
||||
LEFT JOIN users u ON c.user_id = u.user_id
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
### Pattern 3: Real-Time Aggregation
|
||||
|
||||
**Use Case**: Count events per user, per 5-minute window
|
||||
|
||||
```sql
|
||||
CREATE TABLE user_clicks_per_5min AS
|
||||
SELECT
|
||||
user_id,
|
||||
WINDOWSTART AS window_start,
|
||||
WINDOWEND AS window_end,
|
||||
COUNT(*) AS click_count
|
||||
FROM clicks_stream
|
||||
WINDOW TUMBLING (SIZE 5 MINUTES)
|
||||
GROUP BY user_id
|
||||
EMIT CHANGES;
|
||||
|
||||
-- Query current window
|
||||
SELECT * FROM user_clicks_per_5min
|
||||
WHERE user_id = 123
|
||||
AND window_start >= NOW() - INTERVAL 5 MINUTES;
|
||||
```
|
||||
|
||||
### Pattern 4: Detect Anomalies
|
||||
|
||||
**Use Case**: Alert when user clicks >100 times in 1 minute
|
||||
|
||||
```sql
|
||||
CREATE STREAM high_click_alerts AS
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) AS click_count
|
||||
FROM clicks_stream
|
||||
WINDOW TUMBLING (SIZE 1 MINUTE)
|
||||
GROUP BY user_id
|
||||
HAVING COUNT(*) > 100
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
### Pattern 5: Change Data Capture (CDC)
|
||||
|
||||
**Use Case**: Track changes to user table
|
||||
|
||||
```sql
|
||||
-- Create table from CDC topic (Debezium)
|
||||
CREATE TABLE users_cdc (
|
||||
user_id BIGINT PRIMARY KEY,
|
||||
name VARCHAR,
|
||||
email VARCHAR,
|
||||
op VARCHAR -- INSERT, UPDATE, DELETE
|
||||
) WITH (
|
||||
kafka_topic='mysql.users.cdc',
|
||||
value_format='AVRO'
|
||||
);
|
||||
|
||||
-- Stream of changes only
|
||||
CREATE STREAM user_changes AS
|
||||
SELECT * FROM users_cdc
|
||||
WHERE op IN ('UPDATE', 'DELETE')
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
## Join Types
|
||||
|
||||
### 1. Stream-Stream Join
|
||||
|
||||
**Use Case**: Correlate related events within time window
|
||||
|
||||
```sql
|
||||
-- Join page views with clicks within 10 minutes
|
||||
CREATE STREAM page_view_with_clicks AS
|
||||
SELECT
|
||||
v.user_id,
|
||||
v.page AS viewed_page,
|
||||
c.page AS clicked_page
|
||||
FROM page_views v
|
||||
INNER JOIN clicks c WITHIN 10 MINUTES
|
||||
ON v.user_id = c.user_id
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
**Window Types**:
|
||||
- `WITHIN 10 MINUTES` - Events must be within 10 minutes of each other
|
||||
- `GRACE PERIOD 5 MINUTES` - Late-arriving events accepted for 5 more minutes
|
||||
|
||||
### 2. Stream-Table Join
|
||||
|
||||
**Use Case**: Enrich events with current state
|
||||
|
||||
```sql
|
||||
-- Add product details to order events
|
||||
CREATE STREAM enriched_orders AS
|
||||
SELECT
|
||||
o.order_id,
|
||||
o.product_id,
|
||||
p.product_name,
|
||||
p.price
|
||||
FROM orders_stream o
|
||||
LEFT JOIN products_table p ON o.product_id = p.product_id
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
### 3. Table-Table Join
|
||||
|
||||
**Use Case**: Combine two tables (latest state)
|
||||
|
||||
```sql
|
||||
-- Join users with their current cart
|
||||
CREATE TABLE user_with_cart AS
|
||||
SELECT
|
||||
u.user_id,
|
||||
u.name,
|
||||
c.cart_total
|
||||
FROM users u
|
||||
LEFT JOIN shopping_carts c ON u.user_id = c.user_id
|
||||
EMIT CHANGES;
|
||||
```
|
||||
|
||||
## Windowing Types
|
||||
|
||||
### Tumbling Window (Non-Overlapping)
|
||||
|
||||
**Use Case**: Aggregate per fixed time period
|
||||
|
||||
```sql
|
||||
-- Count events every 5 minutes
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) AS event_count
|
||||
FROM events
|
||||
WINDOW TUMBLING (SIZE 5 MINUTES)
|
||||
GROUP BY user_id;
|
||||
|
||||
-- Windows: [0:00-0:05), [0:05-0:10), [0:10-0:15)
|
||||
```
|
||||
|
||||
### Hopping Window (Overlapping)
|
||||
|
||||
**Use Case**: Moving average over time
|
||||
|
||||
```sql
|
||||
-- Count events in 10-minute windows, advancing every 5 minutes
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) AS event_count
|
||||
FROM events
|
||||
WINDOW HOPPING (SIZE 10 MINUTES, ADVANCE BY 5 MINUTES)
|
||||
GROUP BY user_id;
|
||||
|
||||
-- Windows: [0:00-0:10), [0:05-0:15), [0:10-0:20)
|
||||
```
|
||||
|
||||
### Session Window (Event-Based)
|
||||
|
||||
**Use Case**: Group events by user session (gap-based)
|
||||
|
||||
```sql
|
||||
-- Session ends after 30 minutes of inactivity
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) AS session_events
|
||||
FROM events
|
||||
WINDOW SESSION (30 MINUTES)
|
||||
GROUP BY user_id;
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Use Appropriate Data Types
|
||||
|
||||
✅ **DO**:
|
||||
```sql
|
||||
CREATE STREAM orders (
|
||||
order_id BIGINT,
|
||||
user_id BIGINT,
|
||||
total DECIMAL(10, 2), -- Precise currency
|
||||
timestamp TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
❌ **DON'T**:
|
||||
```sql
|
||||
-- WRONG: Using DOUBLE for currency (precision loss!)
|
||||
total DOUBLE
|
||||
```
|
||||
|
||||
### 2. Always Specify Keys
|
||||
|
||||
✅ **DO**:
|
||||
```sql
|
||||
CREATE TABLE users (
|
||||
user_id BIGINT PRIMARY KEY, -- Explicit key
|
||||
name VARCHAR
|
||||
) WITH (kafka_topic='users');
|
||||
```
|
||||
|
||||
❌ **DON'T**:
|
||||
```sql
|
||||
-- WRONG: No key specified (can't join!)
|
||||
CREATE TABLE users (
|
||||
user_id BIGINT,
|
||||
name VARCHAR
|
||||
);
|
||||
```
|
||||
|
||||
### 3. Use Windowing for Aggregations
|
||||
|
||||
✅ **DO**:
|
||||
```sql
|
||||
-- Windowed aggregation (bounded memory)
|
||||
SELECT COUNT(*) FROM events
|
||||
WINDOW TUMBLING (SIZE 1 HOUR)
|
||||
GROUP BY user_id;
|
||||
```
|
||||
|
||||
❌ **DON'T**:
|
||||
```sql
|
||||
-- WRONG: Non-windowed aggregation (unbounded memory!)
|
||||
SELECT COUNT(*) FROM events GROUP BY user_id;
|
||||
```
|
||||
|
||||
### 4. Set Retention Policies
|
||||
|
||||
```sql
|
||||
-- Limit table size (keep last 7 days)
|
||||
CREATE TABLE user_stats (
|
||||
user_id BIGINT PRIMARY KEY,
|
||||
click_count BIGINT
|
||||
) WITH (
|
||||
kafka_topic='user_stats',
|
||||
retention_ms=604800000 -- 7 days
|
||||
);
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### 1. Partition Alignment
|
||||
|
||||
**Ensure joined streams/tables have same partition key**:
|
||||
|
||||
```sql
|
||||
-- GOOD: Both keyed by user_id (co-partitioned)
|
||||
CREATE STREAM clicks (user_id BIGINT KEY, ...)
|
||||
CREATE TABLE users (user_id BIGINT PRIMARY KEY, ...)
|
||||
|
||||
-- Join works efficiently (no repartitioning)
|
||||
SELECT * FROM clicks c
|
||||
JOIN users u ON c.user_id = u.user_id;
|
||||
```
|
||||
|
||||
### 2. Use Materialized Views
|
||||
|
||||
**Pre-compute expensive queries**:
|
||||
|
||||
```sql
|
||||
-- BAD: Compute on every request
|
||||
SELECT COUNT(*) FROM orders WHERE user_id = 123;
|
||||
|
||||
-- GOOD: Materialized table (instant lookup)
|
||||
CREATE TABLE user_order_counts AS
|
||||
SELECT user_id, COUNT(*) AS order_count
|
||||
FROM orders GROUP BY user_id;
|
||||
|
||||
-- Query is now instant
|
||||
SELECT order_count FROM user_order_counts WHERE user_id = 123;
|
||||
```
|
||||
|
||||
### 3. Filter Early
|
||||
|
||||
```sql
|
||||
-- GOOD: Filter before join
|
||||
CREATE STREAM important_events AS
|
||||
SELECT * FROM events WHERE event_type = 'purchase';
|
||||
|
||||
SELECT * FROM important_events e
|
||||
JOIN users u ON e.user_id = u.user_id;
|
||||
|
||||
-- BAD: Join first, filter later (processes all events!)
|
||||
SELECT * FROM events e
|
||||
JOIN users u ON e.user_id = u.user_id
|
||||
WHERE e.event_type = 'purchase';
|
||||
```
|
||||
|
||||
## Common Issues & Solutions
|
||||
|
||||
### Issue 1: Query Timing Out
|
||||
|
||||
**Error**: Query timed out
|
||||
|
||||
**Root Cause**: Non-windowed aggregation on large stream
|
||||
|
||||
**Solution**: Add time window:
|
||||
```sql
|
||||
-- WRONG
|
||||
SELECT COUNT(*) FROM events GROUP BY user_id;
|
||||
|
||||
-- RIGHT
|
||||
SELECT COUNT(*) FROM events
|
||||
WINDOW TUMBLING (SIZE 1 HOUR)
|
||||
GROUP BY user_id;
|
||||
```
|
||||
|
||||
### Issue 2: Partition Mismatch
|
||||
|
||||
**Error**: Cannot join streams (different partition keys)
|
||||
|
||||
**Solution**: Repartition stream:
|
||||
```sql
|
||||
-- Repartition stream by user_id
|
||||
CREATE STREAM clicks_by_user AS
|
||||
SELECT * FROM clicks PARTITION BY user_id;
|
||||
|
||||
-- Now join works
|
||||
SELECT * FROM clicks_by_user c
|
||||
JOIN users u ON c.user_id = u.user_id;
|
||||
```
|
||||
|
||||
### Issue 3: Late-Arriving Events
|
||||
|
||||
**Solution**: Use grace period:
|
||||
```sql
|
||||
SELECT COUNT(*) FROM events
|
||||
WINDOW TUMBLING (SIZE 5 MINUTES, GRACE PERIOD 1 MINUTE)
|
||||
GROUP BY user_id;
|
||||
-- Accepts events up to 1 minute late
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- ksqlDB Documentation: https://docs.ksqldb.io/
|
||||
- ksqlDB Tutorials: https://kafka-tutorials.confluent.io/
|
||||
- Windowing Guide: https://docs.ksqldb.io/en/latest/concepts/time-and-windows-in-ksqldb-queries/
|
||||
- Join Types: https://docs.ksqldb.io/en/latest/developer-guide/joins/
|
||||
|
||||
---
|
||||
|
||||
**Invoke me when you need stream processing, real-time analytics, or SQL-like queries on Kafka!**
|
||||
Reference in New Issue
Block a user