Initial commit

2025-11-30 09:02:56 +08:00
commit 5ae287424a
5 changed files with 1781 additions and 0 deletions
--- a/skills/python/SKILL.md
+++ b/skills/python/SKILL.md
@@ -0,0 +1,976 @@
+---
+name: pytd
+description: Expert assistance for using pytd (Python SDK) to query and import data with Treasure Data. Use this skill when users need help with Python-based data analysis, querying Presto/Hive, importing pandas DataFrames, bulk data uploads, or integrating TD with Python analytical workflows.
+---
+
+# pytd - Treasure Data Python SDK
+
+Expert assistance for querying and importing data to Treasure Data using pytd, the official Python driver for analytical workflows.
+
+## When to Use This Skill
+
+Use this skill when:
+- Querying Treasure Data from Python scripts or Jupyter notebooks
+- Importing pandas DataFrames to TD tables
+- Running Presto or Hive queries from Python
+- Building data pipelines with Python and TD
+- Performing bulk data imports or exports
+- Migrating from deprecated pandas-td library
+- Integrating TD with Python data science workflows
+- Handling large result sets with iterative retrieval
+
+## Core Principles
+
+### 1. Installation
+
+**Standard Installation:**
+```bash
+pip install pytd
+```
+
+**Requirements:**
+- Python 3.9 or later
+- pandas 2.0 or later
+
+### 2. Authentication & Configuration
+
+**Environment Variables (Recommended):**
+```bash
+export TD_API_KEY="your_api_key_here"
+export TD_API_SERVER="https://api.treasuredata.com/"
+```
+
+**Client Initialization:**
+```python
+import pytd
+
+# Using environment variables
+client = pytd.Client(database='sample_datasets')
+
+# Explicit credentials (not recommended for production)
+client = pytd.Client(
+    apikey='your_api_key',
+    endpoint='https://api.treasuredata.com/',
+    database='your_database',
+    default_engine='presto'  # or 'hive'
+)
+```
+
+**Configuration Options:**
+- `apikey`: TD API key (read from `TD_API_KEY` env var if not specified)
+- `endpoint`: TD API server URL (read from `TD_API_SERVER` env var)
+- `database`: Default database name for queries
+- `default_engine`: Query engine - `'presto'` (default) or `'hive'`
+
+**Regional Endpoints:**
+- US: `https://api.treasuredata.com/`
+- Tokyo: `https://api.treasuredata.co.jp/`
+- EU: `https://api.eu01.treasuredata.com/`
+
+### 3. Querying Data
+
+#### Basic Query Execution
+
+```python
+import pytd
+
+client = pytd.Client(database='sample_datasets')
+
+# Execute Presto query
+result = client.query('SELECT symbol, COUNT(1) as cnt FROM nasdaq GROUP BY symbol LIMIT 10')
+
+# Result format: {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ...]}
+print(result['columns'])  # ['symbol', 'cnt']
+print(result['data'])     # [['AAIT', 590], ['AAL', 82], ...]
+```
+
+#### Query with Hive Engine
+
+```python
+# Create Hive client
+client_hive = pytd.Client(
+    database='sample_datasets',
+    default_engine='hive'
+)
+
+# Execute Hive query
+result = client_hive.query('SELECT hivemall_version()')
+```
+
+#### Convert Results to pandas DataFrame
+
+```python
+import pandas as pd
+
+result = client.query('SELECT * FROM nasdaq LIMIT 100')
+
+# Convert to DataFrame
+df = pd.DataFrame(result['data'], columns=result['columns'])
+print(df.head())
+```
+
+### 4. Writing Data to TD
+
+#### Load DataFrame to Table
+
+```python
+import pandas as pd
+import pytd
+
+# Create sample DataFrame
+df = pd.DataFrame({
+    'user_id': [1, 2, 3, 4],
+    'event_name': ['login', 'purchase', 'logout', 'login'],
+    'amount': [None, 99.99, None, None],
+    'timestamp': pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'])
+})
+
+# Initialize client
+client = pytd.Client(database='your_database')
+
+# Upload DataFrame
+client.load_table_from_dataframe(
+    df,
+    'events',  # table name
+    writer='bulk_import',  # writer type
+    if_exists='overwrite'  # or 'append', 'error', 'ignore'
+)
+```
+
+**Parameters:**
+- `df`: pandas DataFrame to upload
+- `table`: Target table name (can be `'database.table'` or just `'table'`)
+- `writer`: Import method - `'bulk_import'` (default), `'insert_into'`, or `'spark'`
+- `if_exists`: What to do if table exists - `'error'` (default), `'overwrite'`, `'append'`, or `'ignore'`
+
+## Common Patterns
+
+### Pattern 1: ETL Pipeline - Query, Transform, Load
+
+```python
+import pytd
+import pandas as pd
+
+# Initialize client
+client = pytd.Client(database='analytics')
+
+# Step 1: Extract - Query data from TD
+query = """
+    SELECT
+        user_id,
+        event_name,
+        event_date,
+        COUNT(*) as event_count
+    FROM raw_events
+    WHERE TD_INTERVAL(time, '-1d', 'JST')
+    GROUP BY user_id, event_name, event_date
+"""
+
+result = client.query(query)
+df = pd.DataFrame(result['data'], columns=result['columns'])
+
+# Step 2: Transform - Process data with pandas
+df['event_date'] = pd.to_datetime(df['event_date'])
+df['is_weekend'] = df['event_date'].dt.dayofweek >= 5
+df['event_count_log'] = df['event_count'].apply(lambda x: pd.np.log1p(x))
+
+# Add metadata
+df['processed_at'] = pd.Timestamp.now()
+df['pipeline_version'] = '1.0'
+
+# Step 3: Load - Write back to TD
+client.load_table_from_dataframe(
+    df,
+    'analytics.user_daily_events',
+    writer='bulk_import',
+    if_exists='append'
+)
+
+print(f"Loaded {len(df)} rows to user_daily_events")
+```
+
+**Explanation:** Complete ETL workflow that extracts yesterday's data, performs pandas transformations, and loads results back to TD. Uses bulk_import for efficient loading.
+
+### Pattern 2: Incremental Data Loading
+
+```python
+import pytd
+import pandas as pd
+from datetime import datetime, timedelta
+
+client = pytd.Client(database='sales')
+
+def load_incremental_data(source_file, table_name, date_column='import_date'):
+    """Load new data incrementally, avoiding duplicates"""
+
+    # Read new data from source
+    new_data = pd.read_csv(source_file)
+    new_data[date_column] = datetime.now()
+
+    # Get max date from existing table
+    try:
+        result = client.query(f"""
+            SELECT MAX({date_column}) as max_date
+            FROM {table_name}
+        """)
+
+        max_date = result['data'][0][0] if result['data'][0][0] else None
+
+        if max_date:
+            # Filter only new records
+            new_data = new_data[new_data[date_column] > max_date]
+            print(f"Loading {len(new_data)} new records after {max_date}")
+        else:
+            print(f"Table empty, loading all {len(new_data)} records")
+
+    except Exception as e:
+        # Table doesn't exist yet
+        print(f"Creating new table with {len(new_data)} records")
+
+    if len(new_data) > 0:
+        client.load_table_from_dataframe(
+            new_data,
+            table_name,
+            writer='bulk_import',
+            if_exists='append'
+        )
+        print("Load complete")
+    else:
+        print("No new data to load")
+
+# Usage
+load_incremental_data('daily_sales.csv', 'sales.transactions')
+```
+
+**Explanation:** Implements incremental loading by checking the latest timestamp in the target table and only loading newer records. Handles first-time loads gracefully.
+
+### Pattern 3: Large Result Set Processing with DB-API
+
+```python
+import pytd
+from pytd.dbapi import connect
+
+client = pytd.Client(database='large_dataset')
+
+# Create DB-API connection
+conn = connect(client)
+cursor = conn.cursor()
+
+# Execute query that might timeout with standard query()
+cursor.execute("""
+    SELECT user_id, event_name, event_time, properties
+    FROM events
+    WHERE TD_INTERVAL(time, '-7d', 'JST')
+""")
+
+# Process results iteratively (memory efficient)
+batch_size = 10000
+processed_count = 0
+
+while True:
+    rows = cursor.fetchmany(batch_size)
+    if not rows:
+        break
+
+    # Process batch
+    for row in rows:
+        user_id, event_name, event_time, properties = row
+        # Process each row
+        process_event(user_id, event_name, event_time, properties)
+
+    processed_count += len(rows)
+    print(f"Processed {processed_count} rows...")
+
+cursor.close()
+conn.close()
+
+print(f"Total processed: {processed_count} rows")
+```
+
+**Explanation:** Uses DB-API for iterative retrieval of large result sets. Prevents memory issues and query timeouts by fetching data in batches. Essential for processing millions of rows.
+
+### Pattern 4: Multi-Database Operations
+
+```python
+import pytd
+import pandas as pd
+
+# Connect to different databases
+source_client = pytd.Client(database='raw_data')
+target_client = pytd.Client(database='analytics')
+
+# Query from source database
+query = """
+    SELECT
+        customer_id,
+        product_id,
+        purchase_date,
+        amount
+    FROM purchases
+    WHERE TD_INTERVAL(time, '-1d', 'JST')
+"""
+
+result = source_client.query(query)
+df = pd.DataFrame(result['data'], columns=result['columns'])
+
+# Enrich data by querying another source
+product_query = "SELECT product_id, product_name, category FROM products"
+products_result = source_client.query(product_query)
+products_df = pd.DataFrame(products_result['data'], columns=products_result['columns'])
+
+# Join data
+enriched_df = df.merge(products_df, on='product_id', how='left')
+
+# Calculate metrics
+daily_summary = enriched_df.groupby(['category', 'purchase_date']).agg({
+    'amount': ['sum', 'mean', 'count'],
+    'customer_id': 'nunique'
+}).reset_index()
+
+daily_summary.columns = ['category', 'date', 'total_sales', 'avg_sale', 'transaction_count', 'unique_customers']
+
+# Write to analytics database
+target_client.load_table_from_dataframe(
+    daily_summary,
+    'daily_category_sales',
+    writer='bulk_import',
+    if_exists='append'
+)
+
+print(f"Loaded {len(daily_summary)} rows to analytics.daily_category_sales")
+```
+
+**Explanation:** Demonstrates working with multiple databases, joining data, performing aggregations, and writing to a different target database.
+
+### Pattern 5: Handling Time-based Data with TD Functions
+
+```python
+import pytd
+import pandas as pd
+from datetime import datetime
+
+client = pytd.Client(database='events')
+
+# Query with TD time functions
+query = """
+    SELECT
+        TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date_jst,
+        COUNT(*) as event_count,
+        COUNT(DISTINCT user_id) as unique_users,
+        APPROX_PERCENTILE(session_duration, 0.5) as median_duration,
+        APPROX_PERCENTILE(session_duration, 0.95) as p95_duration
+    FROM user_sessions
+    WHERE TD_INTERVAL(time, '-7d', 'JST')
+    GROUP BY 1
+    ORDER BY 1 DESC
+"""
+
+result = client.query(query)
+df = pd.DataFrame(result['data'], columns=result['columns'])
+
+# Convert date strings to datetime
+df['date_jst'] = pd.to_datetime(df['date_jst'])
+
+# Add derived metrics
+df['events_per_user'] = df['event_count'] / df['unique_users']
+
+# Write summary back
+client.load_table_from_dataframe(
+    df,
+    'weekly_session_summary',
+    writer='bulk_import',
+    if_exists='overwrite'
+)
+```
+
+**Explanation:** Shows proper use of TD time functions (TD_INTERVAL, TD_TIME_FORMAT) in queries and how to handle the results in pandas.
+
+## Writer Types Comparison
+
+pytd supports three writer methods for loading data:
+
+### 1. bulk_import (Default - Recommended)
+
+**Best for:** Most use cases, especially large datasets
+
+```python
+client.load_table_from_dataframe(
+    df,
+    'table_name',
+    writer='bulk_import',
+    if_exists='append'
+)
+```
+
+**Characteristics:**
+- ✓ Scalable to large datasets
+- ✓ Memory efficient (streams data)
+- ✓ No special permissions required
+- ✓ Best balance of performance and simplicity
+- ✗ Slower than Spark for very large datasets
+- Uses CSV format internally
+
+**When to use:** Default choice for most data loads (100s of MB to GBs)
+
+### 2. insert_into
+
+**Best for:** Small datasets, real-time updates
+
+```python
+client.load_table_from_dataframe(
+    df,
+    'table_name',
+    writer='insert_into',
+    if_exists='append'
+)
+```
+
+**Characteristics:**
+- ✓ Simple, no dependencies
+- ✓ Good for small datasets (<1000 rows)
+- ✗ Not scalable (issues individual INSERT queries)
+- ✗ Slow for large datasets
+- ✗ Uses Presto query capacity
+- Uses Presto INSERT INTO statements
+
+**When to use:** Only for small datasets or when you need immediate writes without bulk import delay
+
+### 3. spark (High Performance)
+
+**Best for:** Very large datasets, high-performance pipelines
+
+```python
+from pytd.writer import SparkWriter
+
+writer = SparkWriter(
+    td_spark_path='/path/to/td-spark-assembly.jar'  # Optional
+)
+
+client.load_table_from_dataframe(
+    df,
+    'table_name',
+    writer=writer,
+    if_exists='append'
+)
+```
+
+**Characteristics:**
+- ✓ Highest performance
+- ✓ Direct writes to Plazma storage
+- ✓ Best for very large datasets (10s of GBs+)
+- ✗ Requires `pytd[spark]` installation
+- ✗ Requires Plazma Public API access (contact support)
+- ✗ Additional dependencies
+
+**When to use:** Large-scale data pipelines requiring maximum throughput
+
+**Enabling Spark Writer:**
+1. Install: `pip install pytd[spark]`
+2. Contact `support@treasuredata.com` to enable Plazma Public API access
+3. (Optional) Download td-spark JAR for custom versions
+
+## Best Practices
+
+1. **Use Environment Variables for Credentials**
+   ```bash
+   export TD_API_KEY="your_api_key"
+   export TD_API_SERVER="https://api.treasuredata.com/"
+   ```
+   Never hardcode API keys in scripts
+
+2. **Choose the Right Writer**
+   - `bulk_import`: Default choice for most scenarios
+   - `insert_into`: Only for small datasets (<1000 rows)
+   - `spark`: For very large datasets with proper setup
+
+3. **Use TD Time Functions in Queries**
+   ```python
+   # Good: Uses partition pruning
+   query = "SELECT * FROM table WHERE TD_INTERVAL(time, '-1d', 'JST')"
+
+   # Avoid: Scans entire table
+   query = "SELECT * FROM table WHERE date = '2024-01-01'"
+   ```
+
+4. **Handle Large Results with DB-API**
+   Use `pytd.dbapi` for queries returning millions of rows to avoid memory issues
+
+5. **Specify Database in Table Name**
+   ```python
+   # Explicit database (recommended)
+   client.load_table_from_dataframe(df, 'database.table')
+
+   # Uses client's default database
+   client.load_table_from_dataframe(df, 'table')
+   ```
+
+6. **Add Time Column for Partitioning**
+   ```python
+   df['time'] = pd.to_datetime(df['timestamp']).astype(int) // 10**9
+   client.load_table_from_dataframe(df, 'table')
+   ```
+
+7. **Use Presto for Analytics, Hive for Special Functions**
+   - Presto: Faster for most analytical queries
+   - Hive: Required for Hivemall, UDFs, some advanced features
+
+8. **Batch Processing for Large ETL**
+   Process data in chunks to avoid memory issues:
+   ```python
+   for chunk in pd.read_csv('large_file.csv', chunksize=100000):
+       # Process chunk
+       client.load_table_from_dataframe(chunk, 'table', if_exists='append')
+   ```
+
+9. **Error Handling**
+   ```python
+   try:
+       result = client.query(query)
+   except Exception as e:
+       print(f"Query failed: {e}")
+       # Handle error appropriately
+   ```
+
+10. **Close Connections in Long-Running Scripts**
+    ```python
+    from pytd.dbapi import connect
+
+    conn = connect(client)
+    try:
+        # Use connection
+        cursor = conn.cursor()
+        cursor.execute(query)
+        # Process results
+    finally:
+        conn.close()
+    ```
+
+## Common Issues and Solutions
+
+### Issue: Import Errors or Module Not Found
+
+**Symptoms:**
+- `ModuleNotFoundError: No module named 'pytd'`
+- `ImportError: cannot import name 'SparkWriter'`
+
+**Solutions:**
+1. **Verify Installation**
+   ```bash
+   pip list | grep pytd
+   ```
+
+2. **Install/Upgrade pytd**
+   ```bash
+   pip install --upgrade pytd
+   ```
+
+3. **For Spark Support**
+   ```bash
+   pip install pytd[spark]
+   ```
+
+4. **Check Python Version**
+   ```bash
+   python --version  # Should be 3.9+
+   ```
+
+### Issue: Authentication Errors
+
+**Symptoms:**
+- `Unauthorized: Invalid API key`
+- `403 Forbidden`
+
+**Solutions:**
+1. **Verify Environment Variables**
+   ```bash
+   echo $TD_API_KEY
+   echo $TD_API_SERVER
+   ```
+
+2. **Check API Key Format**
+   ```python
+   # Verify API key is set correctly
+   import os
+   print(os.getenv('TD_API_KEY'))
+   ```
+
+3. **Verify Regional Endpoint**
+   ```python
+   # US
+   endpoint = 'https://api.treasuredata.com/'
+   # Tokyo
+   endpoint = 'https://api.treasuredata.co.jp/'
+   # EU
+   endpoint = 'https://api.eu01.treasuredata.com/'
+   ```
+
+4. **Check API Key Permissions**
+   - Ensure key has appropriate read/write permissions
+   - Regenerate key if necessary from TD console
+
+### Issue: Query Timeout or Memory Errors
+
+**Symptoms:**
+- Query times out after several minutes
+- `MemoryError` when fetching large results
+- Connection drops during query execution
+
+**Solutions:**
+1. **Use DB-API for Large Results**
+   ```python
+   from pytd.dbapi import connect
+
+   conn = connect(client)
+   cursor = conn.cursor()
+   cursor.execute(query)
+
+   # Fetch in batches
+   for row in cursor.fetchmany(10000):
+       process(row)
+   ```
+
+2. **Add Time Filters for Partition Pruning**
+   ```python
+   query = """
+       SELECT * FROM large_table
+       WHERE TD_INTERVAL(time, '-1d', 'JST')  -- Add this!
+   """
+   ```
+
+3. **Limit Result Size**
+   ```python
+   query = "SELECT * FROM table WHERE ... LIMIT 100000"
+   ```
+
+4. **Use Aggregations Instead of Raw Data**
+   ```python
+   # Instead of fetching all rows
+   query = "SELECT * FROM table"
+
+   # Aggregate first
+   query = """
+       SELECT date, user_id, COUNT(*) as cnt
+       FROM table
+       GROUP BY 1, 2
+   """
+   ```
+
+### Issue: DataFrame Upload Fails
+
+**Symptoms:**
+- `ValueError: DataFrame is empty`
+- Type errors during upload
+- Data corruption in uploaded table
+
+**Solutions:**
+1. **Check DataFrame is Not Empty**
+   ```python
+   if df.empty:
+       print("DataFrame is empty, skipping upload")
+   else:
+       client.load_table_from_dataframe(df, 'table')
+   ```
+
+2. **Handle Data Types Properly**
+   ```python
+   # Convert timestamps to Unix epoch
+   df['time'] = pd.to_datetime(df['timestamp']).astype(int) // 10**9
+
+   # Handle NaN values
+   df['amount'] = df['amount'].fillna(0)
+
+   # Convert to appropriate types
+   df['user_id'] = df['user_id'].astype(str)
+   df['count'] = df['count'].astype(int)
+   ```
+
+3. **Check Column Names**
+   ```python
+   # TD column names should be lowercase and use underscores
+   df.columns = df.columns.str.lower().str.replace(' ', '_')
+   ```
+
+4. **Remove Invalid Characters**
+   ```python
+   # Remove or replace problematic characters
+   df = df.applymap(lambda x: str(x).replace('\x00', '') if isinstance(x, str) else x)
+   ```
+
+5. **Try Different Writer**
+   ```python
+   # If bulk_import fails, try insert_into for debugging
+   client.load_table_from_dataframe(
+       df.head(10),  # Test with small sample
+       'table',
+       writer='insert_into'
+   )
+   ```
+
+### Issue: Spark Writer Not Working
+
+**Symptoms:**
+- `ImportError: Spark writer not available`
+- Spark job fails
+- Permission denied errors
+
+**Solutions:**
+1. **Install Spark Dependencies**
+   ```bash
+   pip install pytd[spark]
+   ```
+
+2. **Enable Plazma Public API**
+   - Contact `support@treasuredata.com`
+   - Request Plazma Public API access for your account
+
+3. **Specify JAR Path (if needed)**
+   ```python
+   from pytd.writer import SparkWriter
+
+   writer = SparkWriter(
+       td_spark_path='/path/to/td-spark-assembly.jar'
+   )
+   ```
+
+4. **Check Permissions**
+   - Ensure API key has write access to target database
+   - Verify Plazma access is enabled
+
+## Advanced Topics
+
+### Custom Query Options
+
+```python
+# Query with custom parameters
+result = client.query(
+    'SELECT * FROM table',
+    engine='presto',
+    priority=1,  # Higher priority (1-2, default 0)
+    retry_limit=3
+)
+```
+
+### Working with Job Status
+
+```python
+# Start query asynchronously
+job = client.query('SELECT COUNT(*) FROM large_table', wait=False)
+
+# Check job status
+print(f"Job ID: {job.job_id}")
+print(f"Status: {job.status()}")
+
+# Wait for completion
+job.wait()
+
+# Get results
+if job.success():
+    result = job.result()
+else:
+    print(f"Job failed: {job.error()}")
+```
+
+### Custom Writers
+
+```python
+from pytd.writer import BulkImportWriter
+
+# Configure writer with custom options
+writer = BulkImportWriter(
+    chunk_size=10000,  # Rows per chunk
+    time_column='time'  # Specify time column
+)
+
+client.load_table_from_dataframe(
+    df,
+    'table',
+    writer=writer,
+    if_exists='append'
+)
+```
+
+### Migrating from pandas-td
+
+If you have existing code using the deprecated `pandas-td` library:
+
+**Before (pandas-td):**
+```python
+import pandas_td as td
+
+con = td.connect(apikey='your_api_key', endpoint='https://api.treasuredata.com/')
+df = td.read_td('SELECT * FROM sample_datasets.nasdaq', con)
+```
+
+**After (pytd):**
+```python
+import pytd.pandas_td as td
+
+con = td.connect(apikey='your_api_key', endpoint='https://api.treasuredata.com/')
+df = td.read_td('SELECT * FROM sample_datasets.nasdaq', con)
+```
+
+Or use the modern pytd API:
+```python
+import pytd
+import pandas as pd
+
+client = pytd.Client(database='sample_datasets')
+result = client.query('SELECT * FROM nasdaq')
+df = pd.DataFrame(result['data'], columns=result['columns'])
+```
+
+## Testing and Development
+
+### Test Connection
+
+```python
+import pytd
+
+try:
+    client = pytd.Client(database='sample_datasets')
+    result = client.query('SELECT 1 as test')
+    print("Connection successful!")
+    print(result)
+except Exception as e:
+    print(f"Connection failed: {e}")
+```
+
+### Verify Data Upload
+
+```python
+import pandas as pd
+import pytd
+
+# Create test data
+test_df = pd.DataFrame({
+    'id': [1, 2, 3],
+    'name': ['Alice', 'Bob', 'Charlie'],
+    'value': [100, 200, 300],
+    'time': [1704067200, 1704153600, 1704240000]  # Unix timestamps
+})
+
+client = pytd.Client(database='test_db')
+
+# Upload
+print("Uploading test data...")
+client.load_table_from_dataframe(
+    test_df,
+    'test_table',
+    writer='bulk_import',
+    if_exists='overwrite'
+)
+
+# Verify
+print("Verifying upload...")
+result = client.query('SELECT * FROM test_table ORDER BY id')
+verify_df = pd.DataFrame(result['data'], columns=result['columns'])
+
+print("\nUploaded data:")
+print(verify_df)
+
+# Check counts match
+assert len(test_df) == len(verify_df), "Row count mismatch!"
+print("\nVerification successful!")
+```
+
+### Performance Testing
+
+```python
+import pytd
+import pandas as pd
+import time
+
+client = pytd.Client(database='test_db')
+
+# Generate test data
+df = pd.DataFrame({
+    'id': range(100000),
+    'value': range(100000),
+    'time': int(time.time())
+})
+
+# Test bulk_import
+start = time.time()
+client.load_table_from_dataframe(df, 'perf_test_bulk', writer='bulk_import', if_exists='overwrite')
+bulk_time = time.time() - start
+print(f"bulk_import: {bulk_time:.2f}s for {len(df)} rows")
+
+# Test insert_into (small sample only!)
+small_df = df.head(100)
+start = time.time()
+client.load_table_from_dataframe(small_df, 'perf_test_insert', writer='insert_into', if_exists='overwrite')
+insert_time = time.time() - start
+print(f"insert_into: {insert_time:.2f}s for {len(small_df)} rows")
+```
+
+## Jupyter Notebook Integration
+
+pytd works seamlessly with Jupyter notebooks:
+
+```python
+# Notebook cell 1: Setup
+import pytd
+import pandas as pd
+import matplotlib.pyplot as plt
+
+client = pytd.Client(database='analytics')
+
+# Notebook cell 2: Query data
+query = """
+    SELECT
+        TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
+        COUNT(*) as events
+    FROM user_events
+    WHERE TD_INTERVAL(time, '-30d', 'JST')
+    GROUP BY 1
+    ORDER BY 1
+"""
+
+result = client.query(query)
+df = pd.DataFrame(result['data'], columns=result['columns'])
+df['date'] = pd.to_datetime(df['date'])
+
+# Notebook cell 3: Visualize
+plt.figure(figsize=(12, 6))
+plt.plot(df['date'], df['events'])
+plt.title('Daily Events - Last 30 Days')
+plt.xlabel('Date')
+plt.ylabel('Event Count')
+plt.xticks(rotation=45)
+plt.tight_layout()
+plt.show()
+
+# Notebook cell 4: Write results back
+summary = df.describe()
+# Process and save summary back to TD if needed
+```
+
+## Resources
+
+- **Documentation**: https://pytd-doc.readthedocs.io/
+- **GitHub Repository**: https://github.com/treasure-data/pytd
+- **PyPI Package**: https://pypi.org/project/pytd/
+- **TD Python Guide**: https://docs.treasuredata.com/
+- **Example Notebooks**: See GitHub repository for Google Colab examples
+
+## Related Skills
+
+- **trino**: Understanding Trino SQL syntax for queries in pytd
+- **hive**: Using Hive-specific functions and syntax
+- **digdag**: Orchestrating Python scripts using pytd in workflows
+- **td-javascript-sdk**: Browser-based data collection (frontend) vs pytd (backend/analytics)
+
+## Comparison with Other Tools
+
+| Tool | Purpose | When to Use |
+|------|---------|-------------|
+| **pytd** | Full-featured Python driver | Analytics, data pipelines, pandas integration |
+| **td-client-python** | Basic REST API wrapper | Simple CRUD, when pytd is too heavy |
+| **pandas-td** (deprecated) | Legacy pandas integration | Don't use - migrate to pytd |
+| **TD Toolbelt** | CLI tool | Command-line operations, shell scripts |
+
+**Recommendation:** Use pytd for all Python-based analytical work and ETL pipelines. Use td-client-python only for basic REST API operations.
+
+---
+
+*Last updated: 2025-01 | pytd version: Latest (Python 3.9+)*