Initial commit
This commit is contained in:
13
.claude-plugin/plugin.json
Normal file
13
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
{
|
||||||
|
"name": "sdk-skills",
|
||||||
|
"description": "Skills for using Treasure Data SDKs including JavaScript SDK for browser-based event tracking and pytd (Python SDK) for querying data, importing pandas DataFrames, and building analytical pipelines",
|
||||||
|
"version": "0.0.0-2025.11.28",
|
||||||
|
"author": {
|
||||||
|
"name": "Treasure Data",
|
||||||
|
"email": "support@treasuredata.com"
|
||||||
|
},
|
||||||
|
"skills": [
|
||||||
|
"./skills/javascript",
|
||||||
|
"./skills/python"
|
||||||
|
]
|
||||||
|
}
|
||||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
# sdk-skills
|
||||||
|
|
||||||
|
Skills for using Treasure Data SDKs including JavaScript SDK for browser-based event tracking and pytd (Python SDK) for querying data, importing pandas DataFrames, and building analytical pipelines
|
||||||
48
plugin.lock.json
Normal file
48
plugin.lock.json
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
{
|
||||||
|
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||||
|
"pluginId": "gh:treasure-data/td-skills:sdk-skills",
|
||||||
|
"normalized": {
|
||||||
|
"repo": null,
|
||||||
|
"ref": "refs/tags/v20251128.0",
|
||||||
|
"commit": "e0c11766a95ad8f5fbf9df27ab4f0a9ab68985b3",
|
||||||
|
"treeHash": "c548802f4867e35e36431653df8a2b46750d6364492c7713d2be881faba1086b",
|
||||||
|
"generatedAt": "2025-11-28T10:28:46.415607Z",
|
||||||
|
"toolVersion": "publish_plugins.py@0.2.0"
|
||||||
|
},
|
||||||
|
"origin": {
|
||||||
|
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||||
|
"branch": "master",
|
||||||
|
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||||
|
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||||
|
},
|
||||||
|
"manifest": {
|
||||||
|
"name": "sdk-skills",
|
||||||
|
"description": "Skills for using Treasure Data SDKs including JavaScript SDK for browser-based event tracking and pytd (Python SDK) for querying data, importing pandas DataFrames, and building analytical pipelines"
|
||||||
|
},
|
||||||
|
"content": {
|
||||||
|
"files": [
|
||||||
|
{
|
||||||
|
"path": "README.md",
|
||||||
|
"sha256": "bd716c3683b3a461922279d012c0f015624c18fe877f142dd2ccd656f685ae5f"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": ".claude-plugin/plugin.json",
|
||||||
|
"sha256": "4f326d7357917a928996ba42c317f01a592ef0e42e8a8b3b96b5e5911e4a8c1e"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "skills/python/SKILL.md",
|
||||||
|
"sha256": "f4dd6da826f89bc3d9323d3d6c4b273a8d850f579dc21fdcb837d9a9cbd0ffb7"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "skills/javascript/SKILL.md",
|
||||||
|
"sha256": "150353b2e36ab029654a8eaf692d45b4d41dd062b6f9dd02360ab1f87dcb68f2"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"dirSha256": "c548802f4867e35e36431653df8a2b46750d6364492c7713d2be881faba1086b"
|
||||||
|
},
|
||||||
|
"security": {
|
||||||
|
"scannedAt": null,
|
||||||
|
"scannerVersion": null,
|
||||||
|
"flags": []
|
||||||
|
}
|
||||||
|
}
|
||||||
741
skills/javascript/SKILL.md
Normal file
741
skills/javascript/SKILL.md
Normal file
@@ -0,0 +1,741 @@
|
|||||||
|
---
|
||||||
|
name: td-javascript-sdk
|
||||||
|
description: Expert assistance for importing data to Treasure Data using the JavaScript SDK. Use this skill when users need help with browser-based event tracking, page analytics, client-side data collection, or implementing TD's JS SDK for web applications.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Treasure Data JavaScript SDK
|
||||||
|
|
||||||
|
Expert assistance for implementing client-side data collection and event tracking with the Treasure Data JavaScript SDK.
|
||||||
|
|
||||||
|
## When to Use This Skill
|
||||||
|
|
||||||
|
Use this skill when:
|
||||||
|
- Implementing browser-based event tracking for web applications
|
||||||
|
- Setting up page view analytics and user behavior tracking
|
||||||
|
- Collecting client-side data (clicks, form submissions, user interactions)
|
||||||
|
- Integrating TD data collection into JavaScript/frontend applications
|
||||||
|
- Migrating from other analytics platforms to TD's event tracking
|
||||||
|
- Troubleshooting JS SDK configuration or data import issues
|
||||||
|
|
||||||
|
## Core Principles
|
||||||
|
|
||||||
|
### 1. Installation Methods
|
||||||
|
|
||||||
|
**Script Loader (Recommended for Most Cases):**
|
||||||
|
```html
|
||||||
|
<script type="text/javascript">
|
||||||
|
!function(t,e){if(void 0===e[t]){e[t]=function(){e[t].clients.push(this),this._init=[Array.prototype.slice.call(arguments)]},e[t].clients=[];for(var r=function(t){return function(){return this["_"+t]=this["_"+t]||[],this["_"+t].push(Array.prototype.slice.call(arguments)),this}},n=["addRecord","blockEvents","fetchServerCookie","fetchGlobalID","fetchUserSegments","resetUUID","ready","setSignedMode","setAnonymousMode","set","trackEvent","trackPageview","trackClicks","unblockEvents"],s=0;s<n.length;s++){var c=n[s];e[t].prototype[c]=r(c)}var o=document.createElement("script");o.type="text/javascript",o.async=!0,o.src=("https:"===document.location.protocol?"https:":"http:")+"//cdn.treasuredata.com/sdk/2.5/td.min.js";var i=document.getElementsByTagName("script")[0];i.parentNode.insertBefore(o,i)}}("Treasure",this);
|
||||||
|
</script>
|
||||||
|
```
|
||||||
|
|
||||||
|
**NPM Package (For Bundlers like Webpack/Browserify):**
|
||||||
|
```bash
|
||||||
|
npm install --save td-js-sdk
|
||||||
|
```
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
import Treasure from 'td-js-sdk';
|
||||||
|
```
|
||||||
|
|
||||||
|
**Important:** The SDK is **browser-only** and does not work with Node.js. For server-side data import, use the REST API or other SDKs.
|
||||||
|
|
||||||
|
### 2. SDK Initialization
|
||||||
|
|
||||||
|
Initialize the SDK with your database name and write-only API key:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'your_database_name',
|
||||||
|
writeKey: 'your_write_only_api_key',
|
||||||
|
startInSignedMode: false // Start in anonymous mode (default)
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Configuration Options:**
|
||||||
|
- `database` (required): TD database name
|
||||||
|
- `writeKey` (required): Write-only API key from TD console
|
||||||
|
- `clientId`: Custom UUID for client identification (auto-generated if not provided)
|
||||||
|
- `startInSignedMode`: `false` for anonymous mode (default), `true` to include PII
|
||||||
|
- `host`: Custom ingestion endpoint (advanced use cases)
|
||||||
|
- `development`: Set to `true` to enable console logging for debugging
|
||||||
|
|
||||||
|
**Obtaining API Keys:**
|
||||||
|
1. Log in to Treasure Data console
|
||||||
|
2. Navigate to your profile settings
|
||||||
|
3. Generate a write-only API key
|
||||||
|
4. Never use master or read-write keys in client-side code
|
||||||
|
|
||||||
|
### 3. Data Import Methods
|
||||||
|
|
||||||
|
#### addRecord() - Custom Event Data
|
||||||
|
|
||||||
|
Send custom data objects to specified tables:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Basic usage
|
||||||
|
treasure.addRecord('events', {
|
||||||
|
event_type: 'purchase',
|
||||||
|
product_id: 'SKU-12345',
|
||||||
|
amount: 99.99,
|
||||||
|
currency: 'USD',
|
||||||
|
user_id: 'user_abc123'
|
||||||
|
});
|
||||||
|
|
||||||
|
// With callback
|
||||||
|
treasure.addRecord('user_actions', {
|
||||||
|
action: 'form_submit',
|
||||||
|
form_id: 'newsletter_signup',
|
||||||
|
success: true
|
||||||
|
}, function(response) {
|
||||||
|
console.log('Data sent successfully', response);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
#### trackPageview() - Page View Tracking
|
||||||
|
|
||||||
|
Track page impressions with automatic context data:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Track to 'pageviews' table
|
||||||
|
treasure.trackPageview('pageviews');
|
||||||
|
|
||||||
|
// Track with custom properties
|
||||||
|
treasure.trackPageview('pageviews', {
|
||||||
|
category: 'product_page',
|
||||||
|
product_id: 'SKU-12345'
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Automatic Context Data Included:**
|
||||||
|
- Page URL, title, referrer, host, path
|
||||||
|
- Screen resolution, viewport dimensions, color depth
|
||||||
|
- Browser language, user agent, platform
|
||||||
|
- TD client ID, SDK version
|
||||||
|
- Timestamp
|
||||||
|
|
||||||
|
#### trackEvent() - Custom Event Tracking
|
||||||
|
|
||||||
|
Track custom events with context:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Basic event
|
||||||
|
treasure.trackEvent('button_click', {
|
||||||
|
button_id: 'cta_signup',
|
||||||
|
location: 'hero_section'
|
||||||
|
});
|
||||||
|
|
||||||
|
// Complex event with nested data
|
||||||
|
treasure.trackEvent('video_interaction', {
|
||||||
|
video_id: 'intro_video_v2',
|
||||||
|
action: 'play',
|
||||||
|
timestamp_seconds: 45,
|
||||||
|
playback_rate: 1.0,
|
||||||
|
quality: '1080p'
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Patterns
|
||||||
|
|
||||||
|
### Pattern 1: E-commerce Tracking
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Initialize SDK
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'ecommerce_analytics',
|
||||||
|
writeKey: 'your_write_only_key'
|
||||||
|
});
|
||||||
|
|
||||||
|
// Set global context (user session data)
|
||||||
|
treasure.set('$global', {
|
||||||
|
user_id: getCurrentUserId(),
|
||||||
|
session_id: getSessionId(),
|
||||||
|
environment: 'production'
|
||||||
|
});
|
||||||
|
|
||||||
|
// Track product views
|
||||||
|
function trackProductView(product) {
|
||||||
|
treasure.trackEvent('product_view', {
|
||||||
|
product_id: product.id,
|
||||||
|
product_name: product.name,
|
||||||
|
category: product.category,
|
||||||
|
price: product.price,
|
||||||
|
currency: 'USD'
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// Track add to cart
|
||||||
|
function trackAddToCart(product, quantity) {
|
||||||
|
treasure.trackEvent('add_to_cart', {
|
||||||
|
product_id: product.id,
|
||||||
|
quantity: quantity,
|
||||||
|
price: product.price,
|
||||||
|
total_value: product.price * quantity
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// Track purchase completion
|
||||||
|
function trackPurchase(order) {
|
||||||
|
treasure.addRecord('purchases', {
|
||||||
|
order_id: order.id,
|
||||||
|
total_amount: order.total,
|
||||||
|
currency: 'USD',
|
||||||
|
items: order.items,
|
||||||
|
payment_method: order.payment_method,
|
||||||
|
shipping_address: order.shipping.country
|
||||||
|
});
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** This pattern sets up comprehensive e-commerce tracking with global context shared across all events, ensuring consistent user and session identification.
|
||||||
|
|
||||||
|
### Pattern 2: Form Tracking with Error Handling
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Form submission tracking
|
||||||
|
document.getElementById('signup-form').addEventListener('submit', function(e) {
|
||||||
|
e.preventDefault();
|
||||||
|
|
||||||
|
var formData = {
|
||||||
|
form_id: 'user_signup',
|
||||||
|
email: this.email.value,
|
||||||
|
plan: this.plan.value,
|
||||||
|
referral_source: document.referrer
|
||||||
|
};
|
||||||
|
|
||||||
|
// Send to TD before form submission
|
||||||
|
treasure.addRecord('form_submissions', formData, function(error, response) {
|
||||||
|
if (error) {
|
||||||
|
console.error('TD tracking failed:', error);
|
||||||
|
// Continue with form submission even if tracking fails
|
||||||
|
}
|
||||||
|
|
||||||
|
// Proceed with actual form submission
|
||||||
|
submitForm(formData);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Form abandonment tracking
|
||||||
|
var formStarted = false;
|
||||||
|
document.querySelectorAll('form input').forEach(function(input) {
|
||||||
|
input.addEventListener('focus', function() {
|
||||||
|
if (!formStarted) {
|
||||||
|
formStarted = true;
|
||||||
|
treasure.trackEvent('form_started', {
|
||||||
|
form_id: 'user_signup'
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
window.addEventListener('beforeunload', function() {
|
||||||
|
if (formStarted && !formSubmitted) {
|
||||||
|
treasure.trackEvent('form_abandoned', {
|
||||||
|
form_id: 'user_signup',
|
||||||
|
fields_completed: getCompletedFieldCount()
|
||||||
|
});
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Tracks form interactions including starts, submissions, and abandonments. Uses callbacks to ensure tracking doesn't block user experience.
|
||||||
|
|
||||||
|
### Pattern 3: User Session Tracking with Privacy Controls
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'user_analytics',
|
||||||
|
writeKey: 'your_write_only_key',
|
||||||
|
startInSignedMode: false // Start anonymous
|
||||||
|
});
|
||||||
|
|
||||||
|
// Check user consent
|
||||||
|
function initializeTracking() {
|
||||||
|
var hasConsent = checkUserConsent();
|
||||||
|
|
||||||
|
if (hasConsent) {
|
||||||
|
// User consented, enable full tracking
|
||||||
|
treasure.setSignedMode();
|
||||||
|
treasure.unblockEvents();
|
||||||
|
|
||||||
|
treasure.set('$global', {
|
||||||
|
user_id: getUserId(),
|
||||||
|
consent_given: true,
|
||||||
|
consent_date: new Date().toISOString()
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
// User declined, use anonymous mode
|
||||||
|
treasure.setAnonymousMode();
|
||||||
|
treasure.blockEvents(); // Or collect minimal data
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Update when consent changes
|
||||||
|
function onConsentGranted() {
|
||||||
|
treasure.setSignedMode();
|
||||||
|
treasure.unblockEvents();
|
||||||
|
treasure.resetUUID(); // Generate new client ID
|
||||||
|
|
||||||
|
treasure.trackEvent('consent_granted', {
|
||||||
|
timestamp: new Date().toISOString()
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
function onConsentRevoked() {
|
||||||
|
treasure.trackEvent('consent_revoked', {
|
||||||
|
timestamp: new Date().toISOString()
|
||||||
|
});
|
||||||
|
|
||||||
|
treasure.setAnonymousMode();
|
||||||
|
treasure.blockEvents();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Implements GDPR/privacy-compliant tracking with consent management. Starts in anonymous mode and only enables full tracking after user consent.
|
||||||
|
|
||||||
|
### Pattern 4: Single Page Application (SPA) Tracking
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Initialize once
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'spa_analytics',
|
||||||
|
writeKey: 'your_write_only_key'
|
||||||
|
});
|
||||||
|
|
||||||
|
// Track route changes (example with vanilla JS)
|
||||||
|
var currentPage = window.location.pathname;
|
||||||
|
|
||||||
|
function trackPageChange() {
|
||||||
|
var newPage = window.location.pathname;
|
||||||
|
|
||||||
|
if (newPage !== currentPage) {
|
||||||
|
// Track page view
|
||||||
|
treasure.trackPageview('pageviews', {
|
||||||
|
previous_page: currentPage,
|
||||||
|
navigation_type: 'spa_route_change'
|
||||||
|
});
|
||||||
|
|
||||||
|
currentPage = newPage;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Listen for history changes
|
||||||
|
window.addEventListener('popstate', trackPageChange);
|
||||||
|
|
||||||
|
// Override pushState and replaceState
|
||||||
|
var pushState = history.pushState;
|
||||||
|
history.pushState = function() {
|
||||||
|
pushState.apply(history, arguments);
|
||||||
|
trackPageChange();
|
||||||
|
};
|
||||||
|
|
||||||
|
var replaceState = history.replaceState;
|
||||||
|
history.replaceState = function() {
|
||||||
|
replaceState.apply(history, arguments);
|
||||||
|
trackPageChange();
|
||||||
|
};
|
||||||
|
|
||||||
|
// Track time on page
|
||||||
|
var pageStartTime = Date.now();
|
||||||
|
|
||||||
|
window.addEventListener('beforeunload', function() {
|
||||||
|
treasure.trackEvent('page_engagement', {
|
||||||
|
page: window.location.pathname,
|
||||||
|
time_spent_seconds: Math.round((Date.now() - pageStartTime) / 1000)
|
||||||
|
});
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Handles SPA routing by intercepting navigation events and tracking virtual page views. Includes time-on-page metrics for engagement analysis.
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
1. **Use Write-Only API Keys** - Never expose master or read-write keys in client-side code. Generate write-only keys specifically for JS SDK use.
|
||||||
|
|
||||||
|
2. **Set Global Defaults** - Use `treasure.set('$global', {...})` for properties that apply to all events (user_id, environment, app_version).
|
||||||
|
|
||||||
|
3. **Table-Level Defaults** - Set common properties per table: `treasure.set('table_name', 'property', 'value')`.
|
||||||
|
|
||||||
|
4. **Asynchronous Loading** - Use the async loader script to avoid blocking page rendering.
|
||||||
|
|
||||||
|
5. **Privacy by Default** - Start in anonymous mode and only enable signed mode after obtaining user consent.
|
||||||
|
|
||||||
|
6. **Validate Data Client-Side** - Check data types and required fields before sending to avoid ingestion errors.
|
||||||
|
|
||||||
|
7. **Use Callbacks for Critical Events** - For important events (purchases, signups), use callbacks to ensure data is sent before navigation.
|
||||||
|
|
||||||
|
8. **Include Context** - Add contextual information (page section, feature version, A/B test variant) to events for richer analysis.
|
||||||
|
|
||||||
|
9. **Batch Related Events** - The SDK handles batching internally, but group related `addRecord` calls together in code for clarity.
|
||||||
|
|
||||||
|
10. **Monitor Console in Development** - Use `development: true` config option during testing to see SDK activity.
|
||||||
|
|
||||||
|
## Common Issues and Solutions
|
||||||
|
|
||||||
|
### Issue: Events Not Appearing in TD
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- Data sent from browser but not visible in TD console
|
||||||
|
- No errors in browser console
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Check API Key Permissions**
|
||||||
|
- Verify you're using a write-only key
|
||||||
|
- Ensure key has write access to the specified database
|
||||||
|
- Check key hasn't been revoked
|
||||||
|
|
||||||
|
2. **Verify Database Name**
|
||||||
|
```javascript
|
||||||
|
// Incorrect: using underscores or special characters incorrectly
|
||||||
|
database: 'my-database' // May fail
|
||||||
|
|
||||||
|
// Correct: use valid database names
|
||||||
|
database: 'my_database' // Works
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Check Browser Network Tab**
|
||||||
|
- Look for requests to `in.treasuredata.com`
|
||||||
|
- Verify 200 OK responses
|
||||||
|
- Check for CORS errors (rare, but possible with custom configurations)
|
||||||
|
|
||||||
|
4. **Data Delay**
|
||||||
|
- Browser SDK uses streaming ingestion
|
||||||
|
- Data may take 1-5 minutes to appear in TD console
|
||||||
|
- For v4.0+ SDK, data should appear within ~1 minute
|
||||||
|
|
||||||
|
### Issue: CORS Errors
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- Browser console shows CORS policy errors
|
||||||
|
- Requests to TD endpoints blocked
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. Ensure using official TD CDN URL for SDK
|
||||||
|
2. Check custom `host` configuration if set
|
||||||
|
3. Verify SSL/HTTPS configuration matches your site
|
||||||
|
|
||||||
|
**Example Fix:**
|
||||||
|
```javascript
|
||||||
|
// Don't customize host unless necessary
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'your_database',
|
||||||
|
writeKey: 'your_key'
|
||||||
|
// Remove custom 'host' setting
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Data Not Matching Expected Schema
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- Fields have wrong data types in TD
|
||||||
|
- Nested objects not properly stored
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Flatten Complex Objects**
|
||||||
|
```javascript
|
||||||
|
// Problematic: deeply nested
|
||||||
|
treasure.addRecord('events', {
|
||||||
|
user: {
|
||||||
|
profile: {
|
||||||
|
name: 'John',
|
||||||
|
age: 30
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// Better: flattened structure
|
||||||
|
treasure.addRecord('events', {
|
||||||
|
user_name: 'John',
|
||||||
|
user_age: 30
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Consistent Data Types**
|
||||||
|
```javascript
|
||||||
|
// Ensure consistent types across events
|
||||||
|
treasure.addRecord('events', {
|
||||||
|
user_id: String(userId), // Always string
|
||||||
|
amount: parseFloat(amount), // Always number
|
||||||
|
timestamp: new Date().toISOString() // Always ISO string
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Handle Null/Undefined Values**
|
||||||
|
```javascript
|
||||||
|
function sendEvent(data) {
|
||||||
|
// Remove undefined/null values
|
||||||
|
var cleanData = Object.keys(data).reduce(function(acc, key) {
|
||||||
|
if (data[key] != null) {
|
||||||
|
acc[key] = data[key];
|
||||||
|
}
|
||||||
|
return acc;
|
||||||
|
}, {});
|
||||||
|
|
||||||
|
treasure.addRecord('events', cleanData);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: SDK Not Loading
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- `Treasure is not defined` errors
|
||||||
|
- SDK script fails to load
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Check Script Placement**
|
||||||
|
```html
|
||||||
|
<!-- Place in <head> before other scripts that use it -->
|
||||||
|
<head>
|
||||||
|
<script type="text/javascript">
|
||||||
|
/* Treasure loader script */
|
||||||
|
</script>
|
||||||
|
</head>
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Verify CDN Availability**
|
||||||
|
- Check network connectivity
|
||||||
|
- Verify CDN URL is correct: `https://cdn.treasuredata.com/sdk/2.5/td.min.js`
|
||||||
|
- Check for ad blockers or privacy extensions blocking the script
|
||||||
|
|
||||||
|
3. **Use Ready Callback**
|
||||||
|
```javascript
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'your_database',
|
||||||
|
writeKey: 'your_key'
|
||||||
|
});
|
||||||
|
|
||||||
|
treasure.ready(function() {
|
||||||
|
// SDK fully loaded, safe to track
|
||||||
|
treasure.trackPageview('pageviews');
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## Advanced Topics
|
||||||
|
|
||||||
|
### Custom Client ID Management
|
||||||
|
|
||||||
|
For cross-device tracking or specific user identification:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'your_database',
|
||||||
|
writeKey: 'your_key',
|
||||||
|
clientId: getUserIdFromYourSystem() // Use your own UUID
|
||||||
|
});
|
||||||
|
|
||||||
|
// Reset UUID when user logs out
|
||||||
|
function onUserLogout() {
|
||||||
|
treasure.resetUUID();
|
||||||
|
treasure.setAnonymousMode();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Server-Side Cookie Integration
|
||||||
|
|
||||||
|
Fetch server-side TD cookies for unified tracking:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
treasure.fetchServerCookie(function(error, cookie) {
|
||||||
|
if (!error && cookie) {
|
||||||
|
console.log('Server cookie:', cookie);
|
||||||
|
// Use cookie data for unified tracking
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### Global ID and User Segments
|
||||||
|
|
||||||
|
Fetch TD Global ID and user segments for personalization:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Fetch Global ID
|
||||||
|
treasure.fetchGlobalID(function(error, globalId) {
|
||||||
|
if (!error) {
|
||||||
|
console.log('TD Global ID:', globalId);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// Fetch user segments (requires audience configuration)
|
||||||
|
treasure.fetchUserSegments({
|
||||||
|
audienceToken: 'your_audience_token',
|
||||||
|
keys: {
|
||||||
|
td_global_id: 'global_id_value'
|
||||||
|
}
|
||||||
|
}, function(error, segments) {
|
||||||
|
if (!error) {
|
||||||
|
console.log('User segments:', segments);
|
||||||
|
// Use for personalization
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### Automatic Click Tracking
|
||||||
|
|
||||||
|
Enable automatic tracking of all link clicks:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
treasure.trackClicks({
|
||||||
|
element: document.body, // Track clicks within body
|
||||||
|
tableName: 'clicks', // Target table
|
||||||
|
attributes: { // Custom attributes to include
|
||||||
|
page: window.location.pathname
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing and Debugging
|
||||||
|
|
||||||
|
### Development Mode
|
||||||
|
|
||||||
|
Enable console logging during development:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
var treasure = new Treasure({
|
||||||
|
database: 'your_database',
|
||||||
|
writeKey: 'your_key',
|
||||||
|
development: true // Enables console logging
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### Manual Testing Checklist
|
||||||
|
|
||||||
|
1. **Verify SDK Initialization**
|
||||||
|
```javascript
|
||||||
|
console.log('Treasure SDK loaded:', typeof Treasure !== 'undefined');
|
||||||
|
console.log('Treasure instance:', treasure);
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Test Event Sending**
|
||||||
|
```javascript
|
||||||
|
treasure.addRecord('test_events', {
|
||||||
|
test_field: 'test_value',
|
||||||
|
timestamp: new Date().toISOString()
|
||||||
|
}, function(error, response) {
|
||||||
|
console.log('Error:', error);
|
||||||
|
console.log('Response:', response);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Check Network Traffic**
|
||||||
|
- Open browser DevTools > Network tab
|
||||||
|
- Filter by `treasuredata.com`
|
||||||
|
- Verify POST requests return 200 OK
|
||||||
|
- Inspect request payload
|
||||||
|
|
||||||
|
4. **Verify in TD Console**
|
||||||
|
- Wait 1-5 minutes for data to appear
|
||||||
|
- Query your table: `SELECT * FROM your_database.test_events ORDER BY time DESC LIMIT 10`
|
||||||
|
|
||||||
|
### Common Testing Patterns
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Create test helper
|
||||||
|
function testTreasureSDK() {
|
||||||
|
console.group('TD SDK Test');
|
||||||
|
|
||||||
|
// Test 1: SDK loaded
|
||||||
|
console.log('1. SDK loaded:', typeof Treasure !== 'undefined');
|
||||||
|
|
||||||
|
// Test 2: Instance created
|
||||||
|
console.log('2. Instance:', treasure);
|
||||||
|
|
||||||
|
// Test 3: Send test event
|
||||||
|
treasure.addRecord('sdk_tests', {
|
||||||
|
test_name: 'connection_test',
|
||||||
|
timestamp: new Date().toISOString(),
|
||||||
|
browser: navigator.userAgent
|
||||||
|
}, function(error, response) {
|
||||||
|
console.log('3. Test event error:', error);
|
||||||
|
console.log('3. Test event response:', response);
|
||||||
|
});
|
||||||
|
|
||||||
|
console.groupEnd();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Run tests
|
||||||
|
testTreasureSDK();
|
||||||
|
```
|
||||||
|
|
||||||
|
## Migration from Other Analytics Platforms
|
||||||
|
|
||||||
|
### From Google Analytics
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// GA pageview
|
||||||
|
ga('send', 'pageview');
|
||||||
|
|
||||||
|
// TD equivalent
|
||||||
|
treasure.trackPageview('pageviews');
|
||||||
|
|
||||||
|
// GA event
|
||||||
|
ga('send', 'event', 'category', 'action', 'label', value);
|
||||||
|
|
||||||
|
// TD equivalent
|
||||||
|
treasure.trackEvent('ga_events', {
|
||||||
|
event_category: 'category',
|
||||||
|
event_action: 'action',
|
||||||
|
event_label: 'label',
|
||||||
|
event_value: value
|
||||||
|
});
|
||||||
|
|
||||||
|
// GA user ID
|
||||||
|
ga('set', 'userId', 'USER_12345');
|
||||||
|
|
||||||
|
// TD equivalent
|
||||||
|
treasure.set('$global', { user_id: 'USER_12345' });
|
||||||
|
```
|
||||||
|
|
||||||
|
### From Mixpanel
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Mixpanel track
|
||||||
|
mixpanel.track('Event Name', { property: 'value' });
|
||||||
|
|
||||||
|
// TD equivalent
|
||||||
|
treasure.trackEvent('Event Name', { property: 'value' });
|
||||||
|
|
||||||
|
// Mixpanel identify
|
||||||
|
mixpanel.identify('USER_12345');
|
||||||
|
|
||||||
|
// TD equivalent
|
||||||
|
treasure.set('$global', { user_id: 'USER_12345' });
|
||||||
|
|
||||||
|
// Mixpanel people.set
|
||||||
|
mixpanel.people.set({ $email: 'user@example.com' });
|
||||||
|
|
||||||
|
// TD equivalent (separate table for user properties)
|
||||||
|
treasure.addRecord('user_properties', {
|
||||||
|
user_id: 'USER_12345',
|
||||||
|
email: 'user@example.com'
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## SDK Version and Updates
|
||||||
|
|
||||||
|
**Current Recommended Version:** 2.5.x
|
||||||
|
|
||||||
|
**Version 4.0+ Important Note:**
|
||||||
|
If using SDK version 4.0 or higher, configuration changes are required for the new streaming ingestion endpoint. Consult the official migration documentation.
|
||||||
|
|
||||||
|
**Version Check:**
|
||||||
|
```javascript
|
||||||
|
console.log('TD SDK Version:', Treasure.version);
|
||||||
|
```
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- **Official Documentation:** https://api-docs.treasuredata.com/en/sdk/js-sdk/
|
||||||
|
- **GitHub Repository:** https://github.com/treasure-data/td-js-sdk
|
||||||
|
- **TD Console:** https://console.treasuredata.com/
|
||||||
|
- **API Keys:** Profile > API Keys in TD Console
|
||||||
|
- **Support:** https://support.treasuredata.com/
|
||||||
|
|
||||||
|
## Related Skills
|
||||||
|
|
||||||
|
- **trino**: Query and analyze data collected via JS SDK using Trino SQL
|
||||||
|
- **hive**: Query and analyze data using Hive SQL
|
||||||
|
- **digdag**: Create workflows to process JS SDK event data
|
||||||
|
- **dbt**: Transform and model JS SDK event data using dbt
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last updated: 2025-01 | SDK Version: 2.5.x*
|
||||||
976
skills/python/SKILL.md
Normal file
976
skills/python/SKILL.md
Normal file
@@ -0,0 +1,976 @@
|
|||||||
|
---
|
||||||
|
name: pytd
|
||||||
|
description: Expert assistance for using pytd (Python SDK) to query and import data with Treasure Data. Use this skill when users need help with Python-based data analysis, querying Presto/Hive, importing pandas DataFrames, bulk data uploads, or integrating TD with Python analytical workflows.
|
||||||
|
---
|
||||||
|
|
||||||
|
# pytd - Treasure Data Python SDK
|
||||||
|
|
||||||
|
Expert assistance for querying and importing data to Treasure Data using pytd, the official Python driver for analytical workflows.
|
||||||
|
|
||||||
|
## When to Use This Skill
|
||||||
|
|
||||||
|
Use this skill when:
|
||||||
|
- Querying Treasure Data from Python scripts or Jupyter notebooks
|
||||||
|
- Importing pandas DataFrames to TD tables
|
||||||
|
- Running Presto or Hive queries from Python
|
||||||
|
- Building data pipelines with Python and TD
|
||||||
|
- Performing bulk data imports or exports
|
||||||
|
- Migrating from deprecated pandas-td library
|
||||||
|
- Integrating TD with Python data science workflows
|
||||||
|
- Handling large result sets with iterative retrieval
|
||||||
|
|
||||||
|
## Core Principles
|
||||||
|
|
||||||
|
### 1. Installation
|
||||||
|
|
||||||
|
**Standard Installation:**
|
||||||
|
```bash
|
||||||
|
pip install pytd
|
||||||
|
```
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
- Python 3.9 or later
|
||||||
|
- pandas 2.0 or later
|
||||||
|
|
||||||
|
### 2. Authentication & Configuration
|
||||||
|
|
||||||
|
**Environment Variables (Recommended):**
|
||||||
|
```bash
|
||||||
|
export TD_API_KEY="your_api_key_here"
|
||||||
|
export TD_API_SERVER="https://api.treasuredata.com/"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Client Initialization:**
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
|
||||||
|
# Using environment variables
|
||||||
|
client = pytd.Client(database='sample_datasets')
|
||||||
|
|
||||||
|
# Explicit credentials (not recommended for production)
|
||||||
|
client = pytd.Client(
|
||||||
|
apikey='your_api_key',
|
||||||
|
endpoint='https://api.treasuredata.com/',
|
||||||
|
database='your_database',
|
||||||
|
default_engine='presto' # or 'hive'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Configuration Options:**
|
||||||
|
- `apikey`: TD API key (read from `TD_API_KEY` env var if not specified)
|
||||||
|
- `endpoint`: TD API server URL (read from `TD_API_SERVER` env var)
|
||||||
|
- `database`: Default database name for queries
|
||||||
|
- `default_engine`: Query engine - `'presto'` (default) or `'hive'`
|
||||||
|
|
||||||
|
**Regional Endpoints:**
|
||||||
|
- US: `https://api.treasuredata.com/`
|
||||||
|
- Tokyo: `https://api.treasuredata.co.jp/`
|
||||||
|
- EU: `https://api.eu01.treasuredata.com/`
|
||||||
|
|
||||||
|
### 3. Querying Data
|
||||||
|
|
||||||
|
#### Basic Query Execution
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
|
||||||
|
client = pytd.Client(database='sample_datasets')
|
||||||
|
|
||||||
|
# Execute Presto query
|
||||||
|
result = client.query('SELECT symbol, COUNT(1) as cnt FROM nasdaq GROUP BY symbol LIMIT 10')
|
||||||
|
|
||||||
|
# Result format: {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ...]}
|
||||||
|
print(result['columns']) # ['symbol', 'cnt']
|
||||||
|
print(result['data']) # [['AAIT', 590], ['AAL', 82], ...]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Query with Hive Engine
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Create Hive client
|
||||||
|
client_hive = pytd.Client(
|
||||||
|
database='sample_datasets',
|
||||||
|
default_engine='hive'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Execute Hive query
|
||||||
|
result = client_hive.query('SELECT hivemall_version()')
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Convert Results to pandas DataFrame
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
result = client.query('SELECT * FROM nasdaq LIMIT 100')
|
||||||
|
|
||||||
|
# Convert to DataFrame
|
||||||
|
df = pd.DataFrame(result['data'], columns=result['columns'])
|
||||||
|
print(df.head())
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Writing Data to TD
|
||||||
|
|
||||||
|
#### Load DataFrame to Table
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
import pytd
|
||||||
|
|
||||||
|
# Create sample DataFrame
|
||||||
|
df = pd.DataFrame({
|
||||||
|
'user_id': [1, 2, 3, 4],
|
||||||
|
'event_name': ['login', 'purchase', 'logout', 'login'],
|
||||||
|
'amount': [None, 99.99, None, None],
|
||||||
|
'timestamp': pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'])
|
||||||
|
})
|
||||||
|
|
||||||
|
# Initialize client
|
||||||
|
client = pytd.Client(database='your_database')
|
||||||
|
|
||||||
|
# Upload DataFrame
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df,
|
||||||
|
'events', # table name
|
||||||
|
writer='bulk_import', # writer type
|
||||||
|
if_exists='overwrite' # or 'append', 'error', 'ignore'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters:**
|
||||||
|
- `df`: pandas DataFrame to upload
|
||||||
|
- `table`: Target table name (can be `'database.table'` or just `'table'`)
|
||||||
|
- `writer`: Import method - `'bulk_import'` (default), `'insert_into'`, or `'spark'`
|
||||||
|
- `if_exists`: What to do if table exists - `'error'` (default), `'overwrite'`, `'append'`, or `'ignore'`
|
||||||
|
|
||||||
|
## Common Patterns
|
||||||
|
|
||||||
|
### Pattern 1: ETL Pipeline - Query, Transform, Load
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
# Initialize client
|
||||||
|
client = pytd.Client(database='analytics')
|
||||||
|
|
||||||
|
# Step 1: Extract - Query data from TD
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
user_id,
|
||||||
|
event_name,
|
||||||
|
event_date,
|
||||||
|
COUNT(*) as event_count
|
||||||
|
FROM raw_events
|
||||||
|
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||||
|
GROUP BY user_id, event_name, event_date
|
||||||
|
"""
|
||||||
|
|
||||||
|
result = client.query(query)
|
||||||
|
df = pd.DataFrame(result['data'], columns=result['columns'])
|
||||||
|
|
||||||
|
# Step 2: Transform - Process data with pandas
|
||||||
|
df['event_date'] = pd.to_datetime(df['event_date'])
|
||||||
|
df['is_weekend'] = df['event_date'].dt.dayofweek >= 5
|
||||||
|
df['event_count_log'] = df['event_count'].apply(lambda x: pd.np.log1p(x))
|
||||||
|
|
||||||
|
# Add metadata
|
||||||
|
df['processed_at'] = pd.Timestamp.now()
|
||||||
|
df['pipeline_version'] = '1.0'
|
||||||
|
|
||||||
|
# Step 3: Load - Write back to TD
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df,
|
||||||
|
'analytics.user_daily_events',
|
||||||
|
writer='bulk_import',
|
||||||
|
if_exists='append'
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Loaded {len(df)} rows to user_daily_events")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Complete ETL workflow that extracts yesterday's data, performs pandas transformations, and loads results back to TD. Uses bulk_import for efficient loading.
|
||||||
|
|
||||||
|
### Pattern 2: Incremental Data Loading
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
import pandas as pd
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
|
||||||
|
client = pytd.Client(database='sales')
|
||||||
|
|
||||||
|
def load_incremental_data(source_file, table_name, date_column='import_date'):
|
||||||
|
"""Load new data incrementally, avoiding duplicates"""
|
||||||
|
|
||||||
|
# Read new data from source
|
||||||
|
new_data = pd.read_csv(source_file)
|
||||||
|
new_data[date_column] = datetime.now()
|
||||||
|
|
||||||
|
# Get max date from existing table
|
||||||
|
try:
|
||||||
|
result = client.query(f"""
|
||||||
|
SELECT MAX({date_column}) as max_date
|
||||||
|
FROM {table_name}
|
||||||
|
""")
|
||||||
|
|
||||||
|
max_date = result['data'][0][0] if result['data'][0][0] else None
|
||||||
|
|
||||||
|
if max_date:
|
||||||
|
# Filter only new records
|
||||||
|
new_data = new_data[new_data[date_column] > max_date]
|
||||||
|
print(f"Loading {len(new_data)} new records after {max_date}")
|
||||||
|
else:
|
||||||
|
print(f"Table empty, loading all {len(new_data)} records")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Table doesn't exist yet
|
||||||
|
print(f"Creating new table with {len(new_data)} records")
|
||||||
|
|
||||||
|
if len(new_data) > 0:
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
new_data,
|
||||||
|
table_name,
|
||||||
|
writer='bulk_import',
|
||||||
|
if_exists='append'
|
||||||
|
)
|
||||||
|
print("Load complete")
|
||||||
|
else:
|
||||||
|
print("No new data to load")
|
||||||
|
|
||||||
|
# Usage
|
||||||
|
load_incremental_data('daily_sales.csv', 'sales.transactions')
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Implements incremental loading by checking the latest timestamp in the target table and only loading newer records. Handles first-time loads gracefully.
|
||||||
|
|
||||||
|
### Pattern 3: Large Result Set Processing with DB-API
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
from pytd.dbapi import connect
|
||||||
|
|
||||||
|
client = pytd.Client(database='large_dataset')
|
||||||
|
|
||||||
|
# Create DB-API connection
|
||||||
|
conn = connect(client)
|
||||||
|
cursor = conn.cursor()
|
||||||
|
|
||||||
|
# Execute query that might timeout with standard query()
|
||||||
|
cursor.execute("""
|
||||||
|
SELECT user_id, event_name, event_time, properties
|
||||||
|
FROM events
|
||||||
|
WHERE TD_INTERVAL(time, '-7d', 'JST')
|
||||||
|
""")
|
||||||
|
|
||||||
|
# Process results iteratively (memory efficient)
|
||||||
|
batch_size = 10000
|
||||||
|
processed_count = 0
|
||||||
|
|
||||||
|
while True:
|
||||||
|
rows = cursor.fetchmany(batch_size)
|
||||||
|
if not rows:
|
||||||
|
break
|
||||||
|
|
||||||
|
# Process batch
|
||||||
|
for row in rows:
|
||||||
|
user_id, event_name, event_time, properties = row
|
||||||
|
# Process each row
|
||||||
|
process_event(user_id, event_name, event_time, properties)
|
||||||
|
|
||||||
|
processed_count += len(rows)
|
||||||
|
print(f"Processed {processed_count} rows...")
|
||||||
|
|
||||||
|
cursor.close()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
print(f"Total processed: {processed_count} rows")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Uses DB-API for iterative retrieval of large result sets. Prevents memory issues and query timeouts by fetching data in batches. Essential for processing millions of rows.
|
||||||
|
|
||||||
|
### Pattern 4: Multi-Database Operations
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
# Connect to different databases
|
||||||
|
source_client = pytd.Client(database='raw_data')
|
||||||
|
target_client = pytd.Client(database='analytics')
|
||||||
|
|
||||||
|
# Query from source database
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
customer_id,
|
||||||
|
product_id,
|
||||||
|
purchase_date,
|
||||||
|
amount
|
||||||
|
FROM purchases
|
||||||
|
WHERE TD_INTERVAL(time, '-1d', 'JST')
|
||||||
|
"""
|
||||||
|
|
||||||
|
result = source_client.query(query)
|
||||||
|
df = pd.DataFrame(result['data'], columns=result['columns'])
|
||||||
|
|
||||||
|
# Enrich data by querying another source
|
||||||
|
product_query = "SELECT product_id, product_name, category FROM products"
|
||||||
|
products_result = source_client.query(product_query)
|
||||||
|
products_df = pd.DataFrame(products_result['data'], columns=products_result['columns'])
|
||||||
|
|
||||||
|
# Join data
|
||||||
|
enriched_df = df.merge(products_df, on='product_id', how='left')
|
||||||
|
|
||||||
|
# Calculate metrics
|
||||||
|
daily_summary = enriched_df.groupby(['category', 'purchase_date']).agg({
|
||||||
|
'amount': ['sum', 'mean', 'count'],
|
||||||
|
'customer_id': 'nunique'
|
||||||
|
}).reset_index()
|
||||||
|
|
||||||
|
daily_summary.columns = ['category', 'date', 'total_sales', 'avg_sale', 'transaction_count', 'unique_customers']
|
||||||
|
|
||||||
|
# Write to analytics database
|
||||||
|
target_client.load_table_from_dataframe(
|
||||||
|
daily_summary,
|
||||||
|
'daily_category_sales',
|
||||||
|
writer='bulk_import',
|
||||||
|
if_exists='append'
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Loaded {len(daily_summary)} rows to analytics.daily_category_sales")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Demonstrates working with multiple databases, joining data, performing aggregations, and writing to a different target database.
|
||||||
|
|
||||||
|
### Pattern 5: Handling Time-based Data with TD Functions
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
import pandas as pd
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
client = pytd.Client(database='events')
|
||||||
|
|
||||||
|
# Query with TD time functions
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date_jst,
|
||||||
|
COUNT(*) as event_count,
|
||||||
|
COUNT(DISTINCT user_id) as unique_users,
|
||||||
|
APPROX_PERCENTILE(session_duration, 0.5) as median_duration,
|
||||||
|
APPROX_PERCENTILE(session_duration, 0.95) as p95_duration
|
||||||
|
FROM user_sessions
|
||||||
|
WHERE TD_INTERVAL(time, '-7d', 'JST')
|
||||||
|
GROUP BY 1
|
||||||
|
ORDER BY 1 DESC
|
||||||
|
"""
|
||||||
|
|
||||||
|
result = client.query(query)
|
||||||
|
df = pd.DataFrame(result['data'], columns=result['columns'])
|
||||||
|
|
||||||
|
# Convert date strings to datetime
|
||||||
|
df['date_jst'] = pd.to_datetime(df['date_jst'])
|
||||||
|
|
||||||
|
# Add derived metrics
|
||||||
|
df['events_per_user'] = df['event_count'] / df['unique_users']
|
||||||
|
|
||||||
|
# Write summary back
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df,
|
||||||
|
'weekly_session_summary',
|
||||||
|
writer='bulk_import',
|
||||||
|
if_exists='overwrite'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation:** Shows proper use of TD time functions (TD_INTERVAL, TD_TIME_FORMAT) in queries and how to handle the results in pandas.
|
||||||
|
|
||||||
|
## Writer Types Comparison
|
||||||
|
|
||||||
|
pytd supports three writer methods for loading data:
|
||||||
|
|
||||||
|
### 1. bulk_import (Default - Recommended)
|
||||||
|
|
||||||
|
**Best for:** Most use cases, especially large datasets
|
||||||
|
|
||||||
|
```python
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df,
|
||||||
|
'table_name',
|
||||||
|
writer='bulk_import',
|
||||||
|
if_exists='append'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Characteristics:**
|
||||||
|
- ✓ Scalable to large datasets
|
||||||
|
- ✓ Memory efficient (streams data)
|
||||||
|
- ✓ No special permissions required
|
||||||
|
- ✓ Best balance of performance and simplicity
|
||||||
|
- ✗ Slower than Spark for very large datasets
|
||||||
|
- Uses CSV format internally
|
||||||
|
|
||||||
|
**When to use:** Default choice for most data loads (100s of MB to GBs)
|
||||||
|
|
||||||
|
### 2. insert_into
|
||||||
|
|
||||||
|
**Best for:** Small datasets, real-time updates
|
||||||
|
|
||||||
|
```python
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df,
|
||||||
|
'table_name',
|
||||||
|
writer='insert_into',
|
||||||
|
if_exists='append'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Characteristics:**
|
||||||
|
- ✓ Simple, no dependencies
|
||||||
|
- ✓ Good for small datasets (<1000 rows)
|
||||||
|
- ✗ Not scalable (issues individual INSERT queries)
|
||||||
|
- ✗ Slow for large datasets
|
||||||
|
- ✗ Uses Presto query capacity
|
||||||
|
- Uses Presto INSERT INTO statements
|
||||||
|
|
||||||
|
**When to use:** Only for small datasets or when you need immediate writes without bulk import delay
|
||||||
|
|
||||||
|
### 3. spark (High Performance)
|
||||||
|
|
||||||
|
**Best for:** Very large datasets, high-performance pipelines
|
||||||
|
|
||||||
|
```python
|
||||||
|
from pytd.writer import SparkWriter
|
||||||
|
|
||||||
|
writer = SparkWriter(
|
||||||
|
td_spark_path='/path/to/td-spark-assembly.jar' # Optional
|
||||||
|
)
|
||||||
|
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df,
|
||||||
|
'table_name',
|
||||||
|
writer=writer,
|
||||||
|
if_exists='append'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Characteristics:**
|
||||||
|
- ✓ Highest performance
|
||||||
|
- ✓ Direct writes to Plazma storage
|
||||||
|
- ✓ Best for very large datasets (10s of GBs+)
|
||||||
|
- ✗ Requires `pytd[spark]` installation
|
||||||
|
- ✗ Requires Plazma Public API access (contact support)
|
||||||
|
- ✗ Additional dependencies
|
||||||
|
|
||||||
|
**When to use:** Large-scale data pipelines requiring maximum throughput
|
||||||
|
|
||||||
|
**Enabling Spark Writer:**
|
||||||
|
1. Install: `pip install pytd[spark]`
|
||||||
|
2. Contact `support@treasuredata.com` to enable Plazma Public API access
|
||||||
|
3. (Optional) Download td-spark JAR for custom versions
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
1. **Use Environment Variables for Credentials**
|
||||||
|
```bash
|
||||||
|
export TD_API_KEY="your_api_key"
|
||||||
|
export TD_API_SERVER="https://api.treasuredata.com/"
|
||||||
|
```
|
||||||
|
Never hardcode API keys in scripts
|
||||||
|
|
||||||
|
2. **Choose the Right Writer**
|
||||||
|
- `bulk_import`: Default choice for most scenarios
|
||||||
|
- `insert_into`: Only for small datasets (<1000 rows)
|
||||||
|
- `spark`: For very large datasets with proper setup
|
||||||
|
|
||||||
|
3. **Use TD Time Functions in Queries**
|
||||||
|
```python
|
||||||
|
# Good: Uses partition pruning
|
||||||
|
query = "SELECT * FROM table WHERE TD_INTERVAL(time, '-1d', 'JST')"
|
||||||
|
|
||||||
|
# Avoid: Scans entire table
|
||||||
|
query = "SELECT * FROM table WHERE date = '2024-01-01'"
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Handle Large Results with DB-API**
|
||||||
|
Use `pytd.dbapi` for queries returning millions of rows to avoid memory issues
|
||||||
|
|
||||||
|
5. **Specify Database in Table Name**
|
||||||
|
```python
|
||||||
|
# Explicit database (recommended)
|
||||||
|
client.load_table_from_dataframe(df, 'database.table')
|
||||||
|
|
||||||
|
# Uses client's default database
|
||||||
|
client.load_table_from_dataframe(df, 'table')
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Add Time Column for Partitioning**
|
||||||
|
```python
|
||||||
|
df['time'] = pd.to_datetime(df['timestamp']).astype(int) // 10**9
|
||||||
|
client.load_table_from_dataframe(df, 'table')
|
||||||
|
```
|
||||||
|
|
||||||
|
7. **Use Presto for Analytics, Hive for Special Functions**
|
||||||
|
- Presto: Faster for most analytical queries
|
||||||
|
- Hive: Required for Hivemall, UDFs, some advanced features
|
||||||
|
|
||||||
|
8. **Batch Processing for Large ETL**
|
||||||
|
Process data in chunks to avoid memory issues:
|
||||||
|
```python
|
||||||
|
for chunk in pd.read_csv('large_file.csv', chunksize=100000):
|
||||||
|
# Process chunk
|
||||||
|
client.load_table_from_dataframe(chunk, 'table', if_exists='append')
|
||||||
|
```
|
||||||
|
|
||||||
|
9. **Error Handling**
|
||||||
|
```python
|
||||||
|
try:
|
||||||
|
result = client.query(query)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Query failed: {e}")
|
||||||
|
# Handle error appropriately
|
||||||
|
```
|
||||||
|
|
||||||
|
10. **Close Connections in Long-Running Scripts**
|
||||||
|
```python
|
||||||
|
from pytd.dbapi import connect
|
||||||
|
|
||||||
|
conn = connect(client)
|
||||||
|
try:
|
||||||
|
# Use connection
|
||||||
|
cursor = conn.cursor()
|
||||||
|
cursor.execute(query)
|
||||||
|
# Process results
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Issues and Solutions
|
||||||
|
|
||||||
|
### Issue: Import Errors or Module Not Found
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- `ModuleNotFoundError: No module named 'pytd'`
|
||||||
|
- `ImportError: cannot import name 'SparkWriter'`
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Verify Installation**
|
||||||
|
```bash
|
||||||
|
pip list | grep pytd
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Install/Upgrade pytd**
|
||||||
|
```bash
|
||||||
|
pip install --upgrade pytd
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **For Spark Support**
|
||||||
|
```bash
|
||||||
|
pip install pytd[spark]
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Check Python Version**
|
||||||
|
```bash
|
||||||
|
python --version # Should be 3.9+
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Authentication Errors
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- `Unauthorized: Invalid API key`
|
||||||
|
- `403 Forbidden`
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Verify Environment Variables**
|
||||||
|
```bash
|
||||||
|
echo $TD_API_KEY
|
||||||
|
echo $TD_API_SERVER
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check API Key Format**
|
||||||
|
```python
|
||||||
|
# Verify API key is set correctly
|
||||||
|
import os
|
||||||
|
print(os.getenv('TD_API_KEY'))
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Verify Regional Endpoint**
|
||||||
|
```python
|
||||||
|
# US
|
||||||
|
endpoint = 'https://api.treasuredata.com/'
|
||||||
|
# Tokyo
|
||||||
|
endpoint = 'https://api.treasuredata.co.jp/'
|
||||||
|
# EU
|
||||||
|
endpoint = 'https://api.eu01.treasuredata.com/'
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Check API Key Permissions**
|
||||||
|
- Ensure key has appropriate read/write permissions
|
||||||
|
- Regenerate key if necessary from TD console
|
||||||
|
|
||||||
|
### Issue: Query Timeout or Memory Errors
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- Query times out after several minutes
|
||||||
|
- `MemoryError` when fetching large results
|
||||||
|
- Connection drops during query execution
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Use DB-API for Large Results**
|
||||||
|
```python
|
||||||
|
from pytd.dbapi import connect
|
||||||
|
|
||||||
|
conn = connect(client)
|
||||||
|
cursor = conn.cursor()
|
||||||
|
cursor.execute(query)
|
||||||
|
|
||||||
|
# Fetch in batches
|
||||||
|
for row in cursor.fetchmany(10000):
|
||||||
|
process(row)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Add Time Filters for Partition Pruning**
|
||||||
|
```python
|
||||||
|
query = """
|
||||||
|
SELECT * FROM large_table
|
||||||
|
WHERE TD_INTERVAL(time, '-1d', 'JST') -- Add this!
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Limit Result Size**
|
||||||
|
```python
|
||||||
|
query = "SELECT * FROM table WHERE ... LIMIT 100000"
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Use Aggregations Instead of Raw Data**
|
||||||
|
```python
|
||||||
|
# Instead of fetching all rows
|
||||||
|
query = "SELECT * FROM table"
|
||||||
|
|
||||||
|
# Aggregate first
|
||||||
|
query = """
|
||||||
|
SELECT date, user_id, COUNT(*) as cnt
|
||||||
|
FROM table
|
||||||
|
GROUP BY 1, 2
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: DataFrame Upload Fails
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- `ValueError: DataFrame is empty`
|
||||||
|
- Type errors during upload
|
||||||
|
- Data corruption in uploaded table
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Check DataFrame is Not Empty**
|
||||||
|
```python
|
||||||
|
if df.empty:
|
||||||
|
print("DataFrame is empty, skipping upload")
|
||||||
|
else:
|
||||||
|
client.load_table_from_dataframe(df, 'table')
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Handle Data Types Properly**
|
||||||
|
```python
|
||||||
|
# Convert timestamps to Unix epoch
|
||||||
|
df['time'] = pd.to_datetime(df['timestamp']).astype(int) // 10**9
|
||||||
|
|
||||||
|
# Handle NaN values
|
||||||
|
df['amount'] = df['amount'].fillna(0)
|
||||||
|
|
||||||
|
# Convert to appropriate types
|
||||||
|
df['user_id'] = df['user_id'].astype(str)
|
||||||
|
df['count'] = df['count'].astype(int)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Check Column Names**
|
||||||
|
```python
|
||||||
|
# TD column names should be lowercase and use underscores
|
||||||
|
df.columns = df.columns.str.lower().str.replace(' ', '_')
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Remove Invalid Characters**
|
||||||
|
```python
|
||||||
|
# Remove or replace problematic characters
|
||||||
|
df = df.applymap(lambda x: str(x).replace('\x00', '') if isinstance(x, str) else x)
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Try Different Writer**
|
||||||
|
```python
|
||||||
|
# If bulk_import fails, try insert_into for debugging
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df.head(10), # Test with small sample
|
||||||
|
'table',
|
||||||
|
writer='insert_into'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Spark Writer Not Working
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
- `ImportError: Spark writer not available`
|
||||||
|
- Spark job fails
|
||||||
|
- Permission denied errors
|
||||||
|
|
||||||
|
**Solutions:**
|
||||||
|
1. **Install Spark Dependencies**
|
||||||
|
```bash
|
||||||
|
pip install pytd[spark]
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Enable Plazma Public API**
|
||||||
|
- Contact `support@treasuredata.com`
|
||||||
|
- Request Plazma Public API access for your account
|
||||||
|
|
||||||
|
3. **Specify JAR Path (if needed)**
|
||||||
|
```python
|
||||||
|
from pytd.writer import SparkWriter
|
||||||
|
|
||||||
|
writer = SparkWriter(
|
||||||
|
td_spark_path='/path/to/td-spark-assembly.jar'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Check Permissions**
|
||||||
|
- Ensure API key has write access to target database
|
||||||
|
- Verify Plazma access is enabled
|
||||||
|
|
||||||
|
## Advanced Topics
|
||||||
|
|
||||||
|
### Custom Query Options
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Query with custom parameters
|
||||||
|
result = client.query(
|
||||||
|
'SELECT * FROM table',
|
||||||
|
engine='presto',
|
||||||
|
priority=1, # Higher priority (1-2, default 0)
|
||||||
|
retry_limit=3
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Working with Job Status
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Start query asynchronously
|
||||||
|
job = client.query('SELECT COUNT(*) FROM large_table', wait=False)
|
||||||
|
|
||||||
|
# Check job status
|
||||||
|
print(f"Job ID: {job.job_id}")
|
||||||
|
print(f"Status: {job.status()}")
|
||||||
|
|
||||||
|
# Wait for completion
|
||||||
|
job.wait()
|
||||||
|
|
||||||
|
# Get results
|
||||||
|
if job.success():
|
||||||
|
result = job.result()
|
||||||
|
else:
|
||||||
|
print(f"Job failed: {job.error()}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Custom Writers
|
||||||
|
|
||||||
|
```python
|
||||||
|
from pytd.writer import BulkImportWriter
|
||||||
|
|
||||||
|
# Configure writer with custom options
|
||||||
|
writer = BulkImportWriter(
|
||||||
|
chunk_size=10000, # Rows per chunk
|
||||||
|
time_column='time' # Specify time column
|
||||||
|
)
|
||||||
|
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
df,
|
||||||
|
'table',
|
||||||
|
writer=writer,
|
||||||
|
if_exists='append'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Migrating from pandas-td
|
||||||
|
|
||||||
|
If you have existing code using the deprecated `pandas-td` library:
|
||||||
|
|
||||||
|
**Before (pandas-td):**
|
||||||
|
```python
|
||||||
|
import pandas_td as td
|
||||||
|
|
||||||
|
con = td.connect(apikey='your_api_key', endpoint='https://api.treasuredata.com/')
|
||||||
|
df = td.read_td('SELECT * FROM sample_datasets.nasdaq', con)
|
||||||
|
```
|
||||||
|
|
||||||
|
**After (pytd):**
|
||||||
|
```python
|
||||||
|
import pytd.pandas_td as td
|
||||||
|
|
||||||
|
con = td.connect(apikey='your_api_key', endpoint='https://api.treasuredata.com/')
|
||||||
|
df = td.read_td('SELECT * FROM sample_datasets.nasdaq', con)
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use the modern pytd API:
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
client = pytd.Client(database='sample_datasets')
|
||||||
|
result = client.query('SELECT * FROM nasdaq')
|
||||||
|
df = pd.DataFrame(result['data'], columns=result['columns'])
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing and Development
|
||||||
|
|
||||||
|
### Test Connection
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
|
||||||
|
try:
|
||||||
|
client = pytd.Client(database='sample_datasets')
|
||||||
|
result = client.query('SELECT 1 as test')
|
||||||
|
print("Connection successful!")
|
||||||
|
print(result)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Connection failed: {e}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify Data Upload
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
import pytd
|
||||||
|
|
||||||
|
# Create test data
|
||||||
|
test_df = pd.DataFrame({
|
||||||
|
'id': [1, 2, 3],
|
||||||
|
'name': ['Alice', 'Bob', 'Charlie'],
|
||||||
|
'value': [100, 200, 300],
|
||||||
|
'time': [1704067200, 1704153600, 1704240000] # Unix timestamps
|
||||||
|
})
|
||||||
|
|
||||||
|
client = pytd.Client(database='test_db')
|
||||||
|
|
||||||
|
# Upload
|
||||||
|
print("Uploading test data...")
|
||||||
|
client.load_table_from_dataframe(
|
||||||
|
test_df,
|
||||||
|
'test_table',
|
||||||
|
writer='bulk_import',
|
||||||
|
if_exists='overwrite'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Verify
|
||||||
|
print("Verifying upload...")
|
||||||
|
result = client.query('SELECT * FROM test_table ORDER BY id')
|
||||||
|
verify_df = pd.DataFrame(result['data'], columns=result['columns'])
|
||||||
|
|
||||||
|
print("\nUploaded data:")
|
||||||
|
print(verify_df)
|
||||||
|
|
||||||
|
# Check counts match
|
||||||
|
assert len(test_df) == len(verify_df), "Row count mismatch!"
|
||||||
|
print("\nVerification successful!")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Performance Testing
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytd
|
||||||
|
import pandas as pd
|
||||||
|
import time
|
||||||
|
|
||||||
|
client = pytd.Client(database='test_db')
|
||||||
|
|
||||||
|
# Generate test data
|
||||||
|
df = pd.DataFrame({
|
||||||
|
'id': range(100000),
|
||||||
|
'value': range(100000),
|
||||||
|
'time': int(time.time())
|
||||||
|
})
|
||||||
|
|
||||||
|
# Test bulk_import
|
||||||
|
start = time.time()
|
||||||
|
client.load_table_from_dataframe(df, 'perf_test_bulk', writer='bulk_import', if_exists='overwrite')
|
||||||
|
bulk_time = time.time() - start
|
||||||
|
print(f"bulk_import: {bulk_time:.2f}s for {len(df)} rows")
|
||||||
|
|
||||||
|
# Test insert_into (small sample only!)
|
||||||
|
small_df = df.head(100)
|
||||||
|
start = time.time()
|
||||||
|
client.load_table_from_dataframe(small_df, 'perf_test_insert', writer='insert_into', if_exists='overwrite')
|
||||||
|
insert_time = time.time() - start
|
||||||
|
print(f"insert_into: {insert_time:.2f}s for {len(small_df)} rows")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Jupyter Notebook Integration
|
||||||
|
|
||||||
|
pytd works seamlessly with Jupyter notebooks:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Notebook cell 1: Setup
|
||||||
|
import pytd
|
||||||
|
import pandas as pd
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
client = pytd.Client(database='analytics')
|
||||||
|
|
||||||
|
# Notebook cell 2: Query data
|
||||||
|
query = """
|
||||||
|
SELECT
|
||||||
|
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
|
||||||
|
COUNT(*) as events
|
||||||
|
FROM user_events
|
||||||
|
WHERE TD_INTERVAL(time, '-30d', 'JST')
|
||||||
|
GROUP BY 1
|
||||||
|
ORDER BY 1
|
||||||
|
"""
|
||||||
|
|
||||||
|
result = client.query(query)
|
||||||
|
df = pd.DataFrame(result['data'], columns=result['columns'])
|
||||||
|
df['date'] = pd.to_datetime(df['date'])
|
||||||
|
|
||||||
|
# Notebook cell 3: Visualize
|
||||||
|
plt.figure(figsize=(12, 6))
|
||||||
|
plt.plot(df['date'], df['events'])
|
||||||
|
plt.title('Daily Events - Last 30 Days')
|
||||||
|
plt.xlabel('Date')
|
||||||
|
plt.ylabel('Event Count')
|
||||||
|
plt.xticks(rotation=45)
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
# Notebook cell 4: Write results back
|
||||||
|
summary = df.describe()
|
||||||
|
# Process and save summary back to TD if needed
|
||||||
|
```
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- **Documentation**: https://pytd-doc.readthedocs.io/
|
||||||
|
- **GitHub Repository**: https://github.com/treasure-data/pytd
|
||||||
|
- **PyPI Package**: https://pypi.org/project/pytd/
|
||||||
|
- **TD Python Guide**: https://docs.treasuredata.com/
|
||||||
|
- **Example Notebooks**: See GitHub repository for Google Colab examples
|
||||||
|
|
||||||
|
## Related Skills
|
||||||
|
|
||||||
|
- **trino**: Understanding Trino SQL syntax for queries in pytd
|
||||||
|
- **hive**: Using Hive-specific functions and syntax
|
||||||
|
- **digdag**: Orchestrating Python scripts using pytd in workflows
|
||||||
|
- **td-javascript-sdk**: Browser-based data collection (frontend) vs pytd (backend/analytics)
|
||||||
|
|
||||||
|
## Comparison with Other Tools
|
||||||
|
|
||||||
|
| Tool | Purpose | When to Use |
|
||||||
|
|------|---------|-------------|
|
||||||
|
| **pytd** | Full-featured Python driver | Analytics, data pipelines, pandas integration |
|
||||||
|
| **td-client-python** | Basic REST API wrapper | Simple CRUD, when pytd is too heavy |
|
||||||
|
| **pandas-td** (deprecated) | Legacy pandas integration | Don't use - migrate to pytd |
|
||||||
|
| **TD Toolbelt** | CLI tool | Command-line operations, shell scripts |
|
||||||
|
|
||||||
|
**Recommendation:** Use pytd for all Python-based analytical work and ETL pipelines. Use td-client-python only for basic REST API operations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last updated: 2025-01 | pytd version: Latest (Python 3.9+)*
|
||||||
Reference in New Issue
Block a user