Initial commit

2025-11-30 08:43:40 +08:00
commit d6cdda3f30
13 changed files with 4664 additions and 0 deletions
--- a/skills/databento/references/cost-optimization.md
+++ b/skills/databento/references/cost-optimization.md
@@ -0,0 +1,501 @@
+# Databento Cost Optimization Guide
+
+Strategies and best practices for minimizing costs when working with Databento market data.
+
+## Databento Pricing Model
+
+### Cost Components
+
+1. **Databento Usage Fees** - Pay-per-use or subscription
+2. **Exchange License Fees** - Venue-dependent (varies by exchange)
+3. **Data Volume** - Amount of data retrieved
+
+### Pricing Tiers
+
+**Free Credits:**
+- $125 free credits for new users
+- Good for initial development and testing
+
+**Usage-Based:**
+- Pay only for data you use
+- Varies by venue and data type
+- No minimum commitment
+
+**Subscriptions:**
+- Basic Plan: $199/month
+- Corporate Actions/Security Master: $299/month
+- Flat-rate access to specific datasets
+
+## Cost Estimation (ALWAYS Do This First)
+
+### Use metadata_get_cost Before Every Request
+
+**Always** estimate cost before fetching data:
+
+```python
+mcp__databento__metadata_get_cost(
+    dataset="GLBX.MDP3",
+    start="2024-01-01",
+    end="2024-01-31",
+    symbols="ES.c.0",
+    schema="ohlcv-1h"
+)
+```
+
+**Returns:**
+- Estimated cost in USD
+- Data size estimate
+- Helps decide if request is reasonable
+
+### When Cost Checks Matter Most
+
+1. **Multi-day tick data** - Can be expensive
+2. **Multiple symbols** - Costs multiply
+3. **High-granularity schemas** - trades, mbp-1, mbo
+4. **Long date ranges** - Weeks or months of data
+
+**Example Cost Check:**
+```python
+# Cheap: 1 month of daily bars
+cost_check(schema="ohlcv-1d", start="2024-01-01", end="2024-01-31")
+# Estimated: $0.10
+
+# Expensive: 1 month of tick trades
+cost_check(schema="trades", start="2024-01-01", end="2024-01-31")
+# Estimated: $50-$200 (depends on volume)
+```
+
+## Historical Data (T+1) - No Licensing Required
+
+**Key Insight:** Historical data that is **24+ hours old (T+1)** does not require exchange licensing fees.
+
+### Cost Breakdown
+
+**Live/Recent Data (< 24 hours):**
+- Databento fees + Exchange licensing fees
+
+**Historical Data (24+ hours old):**
+- Databento fees only (no exchange licensing)
+- Significantly cheaper
+
+### Optimization Strategy
+
+**For Development:**
+- Use T+1 data for strategy development
+- Switch to live data only for production
+
+**For Backtesting:**
+- Always use historical (T+1) data
+- Much more cost-effective
+- Same data quality
+
+**Example:**
+```python
+# Expensive: Yesterday's data (< 24 hours)
+start="2024-11-05"  # Requires licensing
+
+# Cheap: 3 days ago (> 24 hours)
+start="2024-11-03"  # No licensing required
+```
+
+## Schema Selection for Cost
+
+Different schemas have vastly different costs due to data volume.
+
+### Schema Cost Hierarchy (Cheapest to Most Expensive)
+
+1. **ohlcv-1d** (Cheapest)
+   - ~100 bytes per record
+   - ~250 records per symbol per year
+   - **Best for:** Long-term backtesting
+
+2. **ohlcv-1h**
+   - ~100 bytes per record
+   - ~6,000 records per symbol per year
+   - **Best for:** Multi-day backtesting
+
+3. **ohlcv-1m**
+   - ~100 bytes per record
+   - ~360,000 records per symbol per year
+   - **Best for:** Intraday strategies
+
+4. **trades**
+   - ~50 bytes per record
+   - ~100K-500K records per symbol per day (ES/NQ)
+   - **Best for:** Tick analysis (use selectively)
+
+5. **mbp-1**
+   - ~150 bytes per record
+   - ~1M-5M records per symbol per day
+   - **Best for:** Order flow analysis (use selectively)
+
+6. **mbp-10**
+   - ~500 bytes per record
+   - ~1M-5M records per symbol per day
+   - **Best for:** Deep order book analysis (expensive!)
+
+7. **mbo** (Most Expensive)
+   - ~80 bytes per record
+   - ~5M-20M records per symbol per day
+   - **Best for:** Order-level research (very expensive!)
+
+### Cost Optimization Strategy
+
+**Start with lower granularity:**
+1. Develop strategy with ohlcv-1h or ohlcv-1d
+2. Validate with ohlcv-1m if needed
+3. Only use trades/mbp-1 if absolutely necessary
+4. Avoid mbp-10/mbo unless essential
+
+**Example:**
+```python
+# Cheap: Daily bars for 1 year
+schema="ohlcv-1d"
+start="2023-01-01"
+end="2023-12-31"
+# Cost: < $1
+
+# Expensive: Trades for 1 year
+schema="trades"
+start="2023-01-01"
+end="2023-12-31"
+# Cost: $500-$2000 (depending on venue)
+```
+
+## Symbol Selection
+
+Fewer symbols = lower cost. Be selective.
+
+### Strategies
+
+**1. Start with Single Symbol**
+```python
+# Development
+symbols="ES.c.0"  # Just ES
+
+# After validation, expand
+symbols="ES.c.0,NQ.c.0"  # Add NQ
+```
+
+**2. Use Continuous Contracts**
+```python
+# Good: Single continuous contract
+symbols="ES.c.0"  # Covers all front months
+
+# Wasteful: Multiple specific contracts
+symbols="ESH5,ESM5,ESU5,ESZ5"  # Same data, 4x cost
+```
+
+**3. Avoid Symbol Wildcards**
+```python
+# Expensive: All instruments
+symbols="*"  # Don't do this!
+
+# Targeted: Just what you need
+symbols="ES.c.0,NQ.c.0"  # Explicit
+```
+
+## Date Range Optimization
+
+Request only the data you need.
+
+### Strategies
+
+**1. Iterative Refinement**
+```python
+# First: Test with small range
+start="2024-01-01"
+end="2024-01-07"  # Just 1 week
+
+# Then: Expand after validation
+start="2024-01-01"
+end="2024-12-31"  # Full year
+```
+
+**2. Segment Long Ranges**
+```python
+# Instead of: 5 years at once
+start="2019-01-01"
+end="2024-12-31"
+
+# Do: Segment by year
+start="2024-01-01"
+end="2024-12-31"
+# Process, then request next year if needed
+```
+
+**3. Use Limit for Testing**
+```python
+# Test with small limit first
+limit=100  # Just 100 records
+
+# After validation, increase or remove
+limit=10000  # Larger sample
+```
+
+## Batch vs Timeseries Selection
+
+Choose the right tool for the job.
+
+### Timeseries (< 5GB)
+**When to use:**
+- Small to medium datasets
+- Quick exploration
+- <= 1 day of tick data
+- Any OHLCV data
+
+**Benefits:**
+- Immediate results
+- No job management
+- Direct response
+
+**Costs:**
+- Same per-record cost as batch
+
+### Batch Downloads (> 5GB)
+**When to use:**
+- Large datasets (> 5GB)
+- Multi-day tick data
+- Multiple symbols over long periods
+- Production data pipelines
+
+**Benefits:**
+- More efficient for large data
+- Can split output files
+- Asynchronous processing
+
+**Costs:**
+- Same per-record cost as timeseries
+- No additional fees for batch processing
+
+### Decision Matrix
+
+| Data Type | Date Range | Method |
+|-----------|-----------|--------|
+| ohlcv-1h | 1 year | Timeseries |
+| ohlcv-1d | Any | Timeseries |
+| trades | 1 day | Timeseries |
+| trades | 1 week+ | Batch |
+| mbp-1 | 1 day | Batch (safer) |
+| mbp-1 | 1 week+ | Batch |
+
+## DBEQ Bundle - Zero Exchange Fees
+
+Databento offers a special bundle for US equities with **$0 exchange fees**.
+
+### DBEQ.BASIC Dataset
+
+**Coverage:**
+- US equity securities
+- Zero licensing fees
+- Databento usage fees only
+
+**When to use:**
+- Equity market breadth for ES/NQ analysis
+- Testing equity strategies
+- Learning market data APIs
+
+**Example:**
+```python
+# Regular equity dataset (has exchange fees)
+dataset="XNAS.ITCH"
+# Cost: Databento + Nasdaq fees
+
+# DBEQ bundle (no exchange fees)
+dataset="DBEQ.BASIC"
+# Cost: Databento fees only
+```
+
+## Caching and Reuse
+
+Don't fetch the same data multiple times.
+
+### Strategies
+
+**1. Cache Locally**
+```python
+# First request: Fetch and save
+data = fetch_data(...)
+save_to_disk(data, "ES_2024_ohlcv1h.csv")
+
+# Subsequent runs: Load from disk
+data = load_from_disk("ES_2024_ohlcv1h.csv")
+```
+
+**2. Incremental Updates**
+```python
+# Initial: Fetch full history
+start="2023-01-01"
+end="2024-01-01"
+
+# Later: Fetch only new data
+start="2024-01-01"  # Resume from last fetch
+end="2024-12-31"
+```
+
+**3. Share Data Across Analyses**
+```python
+# Fetch once
+historical_data = fetch_data(schema="ohlcv-1h", ...)
+
+# Use multiple times
+backtest_strategy_a(historical_data)
+backtest_strategy_b(historical_data)
+backtest_strategy_c(historical_data)
+```
+
+## Session-Based Analysis
+
+For ES/NQ, consider filtering by trading session to reduce data volume.
+
+### Sessions
+
+- **Asian Session:** 6pm-2am ET
+- **London Session:** 2am-8am ET
+- **New York Session:** 8am-4pm ET
+
+### Cost Benefit
+
+**Full 24-hour data:**
+- Maximum data volume
+- Higher cost
+
+**Session-filtered data:**
+- 1/3 to 1/2 the volume
+- Lower cost
+- May be sufficient for analysis
+
+**Example:**
+```python
+# Expensive: Full 24-hour data
+# Process all records
+
+# Cheaper: NY session only
+# Filter records to 8am-4pm ET
+# ~1/3 the data volume
+```
+
+Use `scripts/session_filter.py` to filter post-fetch, or request only specific hours.
+
+## Monitoring Usage
+
+Track your usage to avoid surprises.
+
+### Check Dashboard
+- Databento provides usage dashboard
+- Monitor monthly spend
+- Set alerts for limits
+
+### Set Monthly Limits
+```python
+# In account settings
+monthly_limit=$500
+```
+
+### Review Costs Regularly
+- Check cost estimates vs actual
+- Identify expensive queries
+- Adjust strategies
+
+## Cost Optimization Checklist
+
+Before every data request:
+
+- [ ] **Estimate cost first** - Use metadata_get_cost
+- [ ] **Use T+1 data** - Avoid < 24 hour data unless necessary
+- [ ] **Choose lowest granularity schema** - Start with ohlcv, not trades
+- [ ] **Minimize symbols** - Only request what you need
+- [ ] **Limit date range** - Test with small range first
+- [ ] **Use continuous contracts** - Avoid requesting multiple months
+- [ ] **Cache locally** - Don't re-fetch same data
+- [ ] **Consider DBEQ** - Use zero-fee dataset when applicable
+- [ ] **Filter by session** - Reduce volume if session-specific
+- [ ] **Use batch for large data** - More efficient for > 5GB
+
+## Cost Examples
+
+### Cheap Requests (< $1)
+
+```python
+# Daily bars for 1 year
+dataset="GLBX.MDP3"
+symbols="ES.c.0"
+schema="ohlcv-1d"
+start="2023-01-01"
+end="2023-12-31"
+# Estimated cost: $0.10
+```
+
+### Moderate Requests ($1-$10)
+
+```python
+# Hourly bars for 1 year
+dataset="GLBX.MDP3"
+symbols="ES.c.0,NQ.c.0"
+schema="ohlcv-1h"
+start="2023-01-01"
+end="2023-12-31"
+# Estimated cost: $2-5
+```
+
+### Expensive Requests ($10-$100)
+
+```python
+# Trades for 1 month
+dataset="GLBX.MDP3"
+symbols="ES.c.0"
+schema="trades"
+start="2024-01-01"
+end="2024-01-31"
+# Estimated cost: $20-50
+```
+
+### Very Expensive Requests ($100+)
+
+```python
+# MBP-10 for 1 month
+dataset="GLBX.MDP3"
+symbols="ES.c.0,NQ.c.0"
+schema="mbp-10"
+start="2024-01-01"
+end="2024-01-31"
+# Estimated cost: $200-500
+```
+
+## Free Credit Strategy
+
+Make the most of your $125 free credits:
+
+1. **Development Phase** - Use free credits for:
+   - Testing API integration
+   - Small-scale strategy development
+   - Learning the platform
+
+2. **Prioritize T+1 Data** - Stretch credits further:
+   - Avoid real-time data during development
+   - Use historical data (no licensing fees)
+
+3. **Start with OHLCV** - Cheapest data:
+   - Develop strategy with daily/hourly bars
+   - Validate before moving to tick data
+
+4. **Cache Everything** - Don't waste credits:
+   - Save all fetched data locally
+   - Reuse for multiple analyses
+
+5. **Monitor Remaining Balance**:
+   - Check credit usage regularly
+   - Adjust requests to stay within budget
+
+## Summary
+
+**Most Important Cost-Saving Strategies:**
+
+1. ✅ **Always check cost first** - Use metadata_get_cost
+2. ✅ **Use T+1 data** - 24+ hours old, no licensing fees
+3. ✅ **Start with OHLCV schemas** - Much cheaper than tick data
+4. ✅ **Cache and reuse data** - Don't fetch twice
+5. ✅ **Be selective with symbols** - Fewer symbols = lower cost
+6. ✅ **Test with small ranges** - Validate before large requests
+7. ✅ **Use continuous contracts** - One symbol instead of many
+8. ✅ **Monitor usage** - Track spending, set limits