Initial commit
This commit is contained in:
393
skills/databento/SKILL.md
Normal file
393
skills/databento/SKILL.md
Normal file
@@ -0,0 +1,393 @@
|
||||
---
|
||||
name: databento
|
||||
description: Use when working with ES/NQ futures market data, before calling any Databento API - follow mandatory four-step workflow (cost check, availability check, fetch, validate); prevents costly API errors and ensures data quality
|
||||
version: 1.0.1
|
||||
triggers:
|
||||
- "ES futures"
|
||||
- "NQ futures"
|
||||
- "market data"
|
||||
- "databento"
|
||||
- "historical prices"
|
||||
- "order flow"
|
||||
- "mcp__databento"
|
||||
---
|
||||
|
||||
# Databento - ES/NQ Futures Market Data Analysis
|
||||
|
||||
## Overview
|
||||
|
||||
Use the databento skill for ES/NQ futures analysis with the Databento market data platform. The skill provides immediate access to critical reference information (schemas, symbology, datasets) and reusable code patterns to eliminate repeated documentation lookups and API usage errors.
|
||||
|
||||
**Primary focus:** ES (E-mini S&P 500) and NQ (E-mini Nasdaq-100) futures analysis
|
||||
**Secondary focus:** Equity market breadth indicators when supporting futures analysis
|
||||
**Priority 1:** Knowledge and workflows to prevent wasted cycles
|
||||
**Priority 2:** Reusable scripts for common data operations
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Trigger this skill when:
|
||||
- User mentions ES, NQ, or futures analysis
|
||||
- User asks to fetch market data or historical prices
|
||||
- User wants to backtest a trading strategy
|
||||
- User asks about databento schemas, datasets, or symbology
|
||||
- User needs order flow or market microstructure analysis
|
||||
- About to use any `mcp__databento__*` MCP tool
|
||||
|
||||
## When NOT to Use This Skill
|
||||
|
||||
Don't use this skill for:
|
||||
- Real-time streaming data (use WebSocket connections directly, not REST API)
|
||||
- Options or spread analysis (limited support in current skill)
|
||||
- Non-CME futures exchanges (skill focuses on GLBX.MDP3 dataset)
|
||||
- Equities-only analysis (use equity-specific tools unless correlating with futures)
|
||||
- Data you already have cached (don't re-fetch repeatedly)
|
||||
|
||||
## The Four Steps (MANDATORY - NO EXCEPTIONS)
|
||||
|
||||
**You MUST complete each step before proceeding to the next. Skipping steps leads to wasted API calls, unexpected costs, or missing data.**
|
||||
|
||||
### Step 1: Check Cost BEFORE Fetching (REQUIRED)
|
||||
|
||||
**BEFORE any data fetch, estimate cost** using `mcp__databento__metadata_get_cost`.
|
||||
|
||||
Parameters needed:
|
||||
- dataset (e.g., "GLBX.MDP3")
|
||||
- start date (YYYY-MM-DD)
|
||||
- end date (optional)
|
||||
- symbols (e.g., "ES.c.0")
|
||||
- schema (e.g., "ohlcv-1h")
|
||||
|
||||
**Why:** Prevents unexpected charges and helps optimize data requests.
|
||||
|
||||
**Gate:** You cannot proceed to Step 3 (fetch) without completing this cost check.
|
||||
|
||||
### Step 2: Validate Dataset Availability (REQUIRED)
|
||||
|
||||
Check that data exists for your requested date range using `mcp__databento__metadata_get_dataset_range`.
|
||||
|
||||
Parameters needed:
|
||||
- dataset (e.g., "GLBX.MDP3")
|
||||
|
||||
**Why:** Returns the available date range so you don't request data that doesn't exist.
|
||||
|
||||
**Gate:** If your requested date range is outside the available range, STOP and adjust your request.
|
||||
|
||||
### Step 3: Fetch Data Appropriately (REQUIRED)
|
||||
|
||||
Choose the right tool based on data size:
|
||||
|
||||
**For small/quick requests (< 5GB, typically < 1 day tick data):**
|
||||
- Use `mcp__databento__timeseries_get_range`
|
||||
- Default limit: 100 records (use limit parameter to adjust)
|
||||
- Returns data directly in response
|
||||
|
||||
**For large requests (> 5GB, multi-day tick data):**
|
||||
- Use `mcp__databento__batch_submit_job`
|
||||
- Poll status with `mcp__databento__batch_list_jobs`
|
||||
- Download with `mcp__databento__batch_download`
|
||||
|
||||
**Gate:** If fetch returns an error, DO NOT retry without checking Steps 1 and 2 first.
|
||||
|
||||
### Step 4: Validate Data Post-Fetch (REQUIRED)
|
||||
|
||||
After receiving data, always validate:
|
||||
- Check for timestamp gaps
|
||||
- Verify expected record counts
|
||||
- Validate price ranges (no negative prices, no extreme outliers)
|
||||
- Check for duplicate timestamps
|
||||
|
||||
Use `scripts/validate_data.py` for automated validation.
|
||||
|
||||
**Gate:** Do not proceed with analysis until validation passes.
|
||||
|
||||
## Red Flags - STOP
|
||||
|
||||
If you catch yourself:
|
||||
- ❌ Fetching data without checking cost first
|
||||
- ❌ Assuming data exists for your date range without checking
|
||||
- ❌ Using `timeseries_get_range` for multi-day tick data (> 5GB)
|
||||
- ❌ Skipping post-fetch validation
|
||||
- ❌ Making multiple identical API calls (cache your data!)
|
||||
- ❌ Using wrong `stype_in` for continuous contracts
|
||||
- ❌ Requesting data in wrong date format (not YYYY-MM-DD)
|
||||
|
||||
**STOP. Return to The Four Steps. Follow them in order.**
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
Before marking data work complete:
|
||||
|
||||
- [ ] Cost estimated and acceptable
|
||||
- [ ] Dataset availability confirmed for date range
|
||||
- [ ] Appropriate fetch method chosen (timeseries vs batch)
|
||||
- [ ] Data fetched successfully
|
||||
- [ ] Post-fetch validation passed (no gaps, valid prices, expected count)
|
||||
- [ ] Data cached locally (not fetching repeatedly)
|
||||
|
||||
Can't check all boxes? A step was skipped. Review The Four Steps above.
|
||||
|
||||
## Quick Reference: Essential Information
|
||||
|
||||
### Primary Dataset
|
||||
**GLBX.MDP3** - CME Globex MDP 3.0 (for ES/NQ futures)
|
||||
|
||||
### Common Schemas
|
||||
|
||||
| Schema | Description | When to Use | Typical Limit |
|
||||
|--------|-------------|-------------|---------------|
|
||||
| `ohlcv-1h` | 1-hour OHLCV bars | Multi-day backtesting | 100 bars |
|
||||
| `ohlcv-1d` | Daily OHLCV bars | Long-term analysis | 100 bars |
|
||||
| `trades` | Individual trades | Intraday analysis, order flow | Use batch for > 1 day |
|
||||
| `mbp-1` | Top of book (L1) | Bid/ask spread, microstructure | Use batch for > 1 day |
|
||||
| `mbp-10` | 10 levels of depth (L2) | Order book analysis | Use batch for > 1 day |
|
||||
|
||||
### ES/NQ Symbol Patterns
|
||||
|
||||
| Symbol | Description | Example Use Case |
|
||||
|--------|-------------|------------------|
|
||||
| `ES.c.0` | ES front month continuous (calendar roll) | Standard backtesting |
|
||||
| `NQ.c.0` | NQ front month continuous (calendar roll) | Standard backtesting |
|
||||
| `ES.n.0` | ES front month (open interest roll) | Avoiding roll timing issues |
|
||||
| `ESH5` | Specific contract (Mar 2025) | Analyzing specific expiration |
|
||||
| `ES.c.1` | ES second month continuous | Spread analysis |
|
||||
|
||||
**Roll Strategies:**
|
||||
- `.c.X` = Calendar-based roll (switches on fixed dates)
|
||||
- `.n.X` = Open interest-based roll (switches when OI moves)
|
||||
- `.v.X` = Volume-based roll (switches when volume moves)
|
||||
|
||||
### Common Symbology Types (stypes)
|
||||
|
||||
| Stype | Description | When to Use |
|
||||
|-------|-------------|-------------|
|
||||
| `raw_symbol` | Native exchange symbol | When you have exact contract codes |
|
||||
| `instrument_id` | Databento's numeric ID | After symbol resolution |
|
||||
| `continuous` | Continuous contract notation | For backtesting across rolls |
|
||||
| `parent` | Parent contract symbol | For options or complex instruments |
|
||||
|
||||
## MCP Tool Selection Guide
|
||||
|
||||
### For Current/Live Data
|
||||
|
||||
**Get current ES/NQ quote:**
|
||||
```
|
||||
mcp__databento__get_futures_quote
|
||||
- symbol: "ES" or "NQ"
|
||||
```
|
||||
|
||||
**Get current trading session:**
|
||||
```
|
||||
mcp__databento__get_session_info
|
||||
- timestamp: (optional, defaults to now)
|
||||
```
|
||||
|
||||
**Get recent historical bars:**
|
||||
```
|
||||
mcp__databento__get_historical_bars
|
||||
- symbol: "ES" or "NQ"
|
||||
- timeframe: "1h", "H4", or "1d"
|
||||
- count: number of bars (max 100)
|
||||
```
|
||||
|
||||
### For Historical Data Analysis
|
||||
|
||||
**Timeseries (< 5GB, direct response):**
|
||||
```
|
||||
mcp__databento__timeseries_get_range
|
||||
- dataset: "GLBX.MDP3"
|
||||
- symbols: "ES.c.0,NQ.c.0" (comma-separated, max 2000)
|
||||
- schema: "ohlcv-1h", "trades", "mbp-1", etc.
|
||||
- start: "2024-01-01" (YYYY-MM-DD or ISO 8601)
|
||||
- end: "2024-01-31" (optional)
|
||||
- limit: number of records (optional)
|
||||
```
|
||||
|
||||
**Batch Download (> 5GB, async processing):**
|
||||
```
|
||||
mcp__databento__batch_submit_job
|
||||
- dataset: "GLBX.MDP3"
|
||||
- symbols: ["ES.c.0", "NQ.c.0"] (array, max 2000)
|
||||
- schema: "trades", "mbp-1", etc.
|
||||
- start: "2024-01-01"
|
||||
- end: "2024-12-31"
|
||||
- encoding: "dbn" (native), "csv", or "json"
|
||||
- compression: "zstd" (default), "gzip", or "none"
|
||||
```
|
||||
|
||||
Then monitor with `mcp__databento__batch_list_jobs` and download with `mcp__databento__batch_download`.
|
||||
|
||||
### For Symbol Resolution
|
||||
|
||||
**Resolve symbols between types:**
|
||||
```
|
||||
mcp__databento__symbology_resolve
|
||||
- dataset: "GLBX.MDP3"
|
||||
- symbols: ["ES.c.0", "NQ.c.0"]
|
||||
- stype_in: "continuous" (input type)
|
||||
- stype_out: "instrument_id" (output type)
|
||||
- start_date: "2024-01-01"
|
||||
- end_date: "2024-12-31" (optional)
|
||||
```
|
||||
|
||||
### For Metadata Discovery
|
||||
|
||||
**List available schemas:**
|
||||
```
|
||||
mcp__databento__metadata_list_schemas
|
||||
- dataset: "GLBX.MDP3"
|
||||
```
|
||||
|
||||
**Get dataset date range:**
|
||||
```
|
||||
mcp__databento__metadata_get_dataset_range
|
||||
- dataset: "GLBX.MDP3"
|
||||
```
|
||||
|
||||
**Estimate cost:**
|
||||
```
|
||||
mcp__databento__metadata_get_cost
|
||||
- dataset: "GLBX.MDP3"
|
||||
- start: "2024-01-01"
|
||||
- end: "2024-01-31" (optional)
|
||||
- symbols: "ES.c.0"
|
||||
- schema: "ohlcv-1h"
|
||||
```
|
||||
|
||||
## Analysis Workflow Patterns
|
||||
|
||||
### Historical Backtesting (OHLCV)
|
||||
1. Check cost for date range
|
||||
2. Fetch OHLCV data (1h, 4h, or 1d timeframe)
|
||||
3. Validate data completeness
|
||||
4. Perform analysis
|
||||
5. Consider using `scripts/fetch_ohlcv.py` for standard pattern
|
||||
|
||||
**Typical request:**
|
||||
- Schema: `ohlcv-1h` or `ohlcv-1d`
|
||||
- Symbols: `ES.c.0` or `NQ.c.0`
|
||||
- Limit: 100 bars per request (adjust as needed)
|
||||
|
||||
### Intraday Order Flow Analysis
|
||||
1. Check cost (important for tick data!)
|
||||
2. Use batch job for multi-day tick data
|
||||
3. Fetch trades or mbp-1 schema
|
||||
4. Filter by trading session if needed (use `scripts/session_filter.py`)
|
||||
5. Validate tick data completeness
|
||||
|
||||
**Typical request:**
|
||||
- Schema: `trades` or `mbp-1`
|
||||
- Use batch download for > 1 day of data
|
||||
- Consider session filtering for session-specific analysis
|
||||
|
||||
### Cross-Market Analysis (ES/NQ + Equities)
|
||||
1. Fetch ES/NQ data from GLBX.MDP3
|
||||
2. Fetch equity breadth from XNAS.ITCH (Nasdaq dataset)
|
||||
3. Align timestamps for correlation
|
||||
4. Perform cross-market analysis
|
||||
|
||||
**Datasets needed:**
|
||||
- GLBX.MDP3 (ES/NQ futures)
|
||||
- XNAS.ITCH (Nasdaq equities)
|
||||
|
||||
## Reference Files
|
||||
|
||||
Load these reference files as needed for detailed information:
|
||||
|
||||
### references/schemas.md
|
||||
Comprehensive field-level documentation for all schemas (trades, mbp-1, ohlcv).
|
||||
|
||||
**Load when:** Need to understand specific fields, data types, or schema structure.
|
||||
|
||||
### references/symbology.md
|
||||
Detailed symbology guide with continuous contracts, roll strategies, and expiration handling.
|
||||
|
||||
**Load when:** Working with continuous contracts, need to understand roll timing, or resolving symbol types.
|
||||
|
||||
### references/api-parameters.md
|
||||
Complete parameter reference for all MCP tools with enum values and format requirements.
|
||||
|
||||
**Load when:** Uncertain about parameter formats, enum values, or tool-specific requirements.
|
||||
|
||||
### references/cost-optimization.md
|
||||
Strategies for minimizing costs including T+1 data usage and batch optimization.
|
||||
|
||||
**Load when:** Working with large datasets or need to optimize data costs.
|
||||
|
||||
## Reusable Scripts
|
||||
|
||||
### scripts/fetch_ohlcv.py
|
||||
Standard pattern for fetching OHLCV data with built-in cost checks, error handling, and validation.
|
||||
|
||||
**Use when:** Fetching OHLCV bars for backtesting or analysis.
|
||||
|
||||
**Features:**
|
||||
- Automatic cost estimation before fetch
|
||||
- Error handling with retries
|
||||
- Post-fetch data validation
|
||||
- Export to CSV/pandas options
|
||||
|
||||
### scripts/validate_data.py
|
||||
Data quality validation to catch issues early.
|
||||
|
||||
**Use when:** After fetching any market data.
|
||||
|
||||
**Features:**
|
||||
- Timestamp gap detection
|
||||
- Record count verification
|
||||
- Price range validation
|
||||
- Summary quality report
|
||||
|
||||
### scripts/session_filter.py
|
||||
Filter data by trading session (Asian/London/NY).
|
||||
|
||||
**Use when:** Performing session-specific analysis.
|
||||
|
||||
**Features:**
|
||||
- Session detection using get_session_info
|
||||
- Historical data filtering by session
|
||||
- Session transition handling
|
||||
- Session-specific statistics
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always check cost first** - Prevents surprises and helps optimize requests
|
||||
2. **Use continuous contracts for backtesting** - Avoids roll gaps in analysis
|
||||
3. **Validate data quality** - Catch issues before running analysis
|
||||
4. **Use batch jobs for large data** - More efficient for > 5GB requests
|
||||
5. **Cache reusable data** - Don't re-fetch the same data repeatedly
|
||||
6. **Consider T+1 data** - Historical data (24+ hours old) has lower costs
|
||||
7. **Use appropriate schema** - Match schema granularity to analysis needs
|
||||
8. **Filter by session when relevant** - Session-based patterns are important for ES/NQ
|
||||
|
||||
---
|
||||
|
||||
## After Using This Skill
|
||||
|
||||
**REQUIRED NEXT STEPS:**
|
||||
|
||||
1. **Validate data quality** - Use verification checklist (Step 4) to confirm data integrity
|
||||
2. **Cache results** - Save fetched data locally to avoid redundant API calls and costs
|
||||
3. **Document assumptions** - Record roll strategy, schema choice, date range in analysis notes
|
||||
|
||||
**OPTIONAL NEXT STEPS:**
|
||||
|
||||
- **Cost tracking** - Log actual cost vs estimate for future budget planning
|
||||
- **Performance notes** - Document fetch time and data volume for optimization
|
||||
- **Quality metrics** - Track data completeness, gaps, or anomalies for future reference
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
**v1.0.1** (2025-11-14)
|
||||
- Added structured frontmatter with triggers list
|
||||
- Added "When NOT to Use" section
|
||||
- Strengthened "The Four Steps" with MANDATORY language and gates
|
||||
- Added "Red Flags - STOP" section
|
||||
- Added "Verification Checklist"
|
||||
- Improved description to follow superpowers pattern
|
||||
|
||||
**v1.0.0** (2025-11-06)
|
||||
- Initial databento skill creation
|
||||
- Comprehensive reference tables and MCP tool guide
|
||||
- Bundled resources (references and scripts)
|
||||
541
skills/databento/references/api-parameters.md
Normal file
541
skills/databento/references/api-parameters.md
Normal file
@@ -0,0 +1,541 @@
|
||||
# Databento API Parameters Reference
|
||||
|
||||
Complete parameter reference for all Databento MCP tools with accepted values, formats, and requirements.
|
||||
|
||||
## Date and Time Formats
|
||||
|
||||
### Date Format
|
||||
**Accepted formats:**
|
||||
- `YYYY-MM-DD` (e.g., "2024-01-15")
|
||||
- ISO 8601 with time (e.g., "2024-01-15T14:30:00Z")
|
||||
|
||||
**Important:**
|
||||
- Dates are in UTC timezone
|
||||
- Inclusive for `start`, exclusive for `end`
|
||||
- Time portion is optional
|
||||
|
||||
### Timestamp Format
|
||||
**Accepted formats:**
|
||||
- ISO 8601 string: "2024-01-15T14:30:00Z"
|
||||
- Unix timestamp (seconds): 1705329000
|
||||
- Unix timestamp (nanoseconds): 1705329000000000000
|
||||
|
||||
## Schema Parameter
|
||||
|
||||
Valid schema values for historical data requests.
|
||||
|
||||
### OHLCV Schemas
|
||||
```
|
||||
"ohlcv-1s" # 1-second bars
|
||||
"ohlcv-1m" # 1-minute bars
|
||||
"ohlcv-1h" # 1-hour bars
|
||||
"ohlcv-1d" # Daily bars
|
||||
"ohlcv-eod" # End-of-day bars
|
||||
```
|
||||
|
||||
### Trade and Quote Schemas
|
||||
```
|
||||
"trades" # Individual trades
|
||||
"mbp-1" # Market by price - level 1 (top of book)
|
||||
"mbp-10" # Market by price - 10 levels of depth
|
||||
"mbo" # Market by order - level 3 (order-level)
|
||||
"tbbo" # Top of book best bid/offer
|
||||
```
|
||||
|
||||
### Metadata Schemas
|
||||
```
|
||||
"definition" # Instrument definitions and metadata
|
||||
"statistics" # Market statistics
|
||||
"status" # Trading status changes
|
||||
"imbalance" # Order imbalance data
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# timeseries_get_range
|
||||
schema="ohlcv-1h"
|
||||
|
||||
# batch_submit_job
|
||||
schema="trades"
|
||||
```
|
||||
|
||||
## Symbology Type (stype) Parameter
|
||||
|
||||
Used for symbol input and output format specification.
|
||||
|
||||
### stype_in (Input Symbol Type)
|
||||
|
||||
```
|
||||
"raw_symbol" # Native exchange symbols (ESH5, AAPL)
|
||||
"instrument_id" # Databento numeric IDs
|
||||
"continuous" # Continuous contracts (ES.c.0)
|
||||
"parent" # Parent symbols (ES, NQ)
|
||||
"nasdaq" # Nasdaq symbology
|
||||
"cms" # CMS symbology
|
||||
"bats" # BATS symbology
|
||||
"smart" # Smart routing symbols
|
||||
```
|
||||
|
||||
### stype_out (Output Symbol Type)
|
||||
|
||||
Same values as `stype_in`.
|
||||
|
||||
**Common Patterns:**
|
||||
```python
|
||||
# Continuous to instrument_id (most common)
|
||||
stype_in="continuous"
|
||||
stype_out="instrument_id"
|
||||
|
||||
# Raw symbol to instrument_id
|
||||
stype_in="raw_symbol"
|
||||
stype_out="instrument_id"
|
||||
|
||||
# Continuous to raw symbol (see current contract)
|
||||
stype_in="continuous"
|
||||
stype_out="raw_symbol"
|
||||
```
|
||||
|
||||
**Important:** Always match stype_in to your actual symbol format:
|
||||
- `"ES.c.0"` → stype_in="continuous"
|
||||
- `"ESH5"` → stype_in="raw_symbol"
|
||||
- `123456` → stype_in="instrument_id"
|
||||
|
||||
## Dataset Parameter
|
||||
|
||||
Dataset codes identify the data source and venue.
|
||||
|
||||
### Common Datasets
|
||||
|
||||
**Futures (CME):**
|
||||
```
|
||||
"GLBX.MDP3" # CME Globex - ES, NQ, and other CME futures
|
||||
```
|
||||
|
||||
**Equities:**
|
||||
```
|
||||
"XNAS.ITCH" # Nasdaq - all Nasdaq-listed stocks
|
||||
"XNYS.PILLAR" # NYSE - NYSE-listed stocks
|
||||
"XCHI.PILLAR" # Chicago Stock Exchange
|
||||
"BATS.PITCH" # BATS exchange
|
||||
"IEXG.TOPS" # IEX exchange
|
||||
```
|
||||
|
||||
**Options:**
|
||||
```
|
||||
"OPRA.PILLAR" # US equity options
|
||||
```
|
||||
|
||||
**Crypto:**
|
||||
```
|
||||
"DBEQ.BASIC" # Databento equities (subset)
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# ES/NQ futures
|
||||
dataset="GLBX.MDP3"
|
||||
|
||||
# Nasdaq equities
|
||||
dataset="XNAS.ITCH"
|
||||
```
|
||||
|
||||
## Symbols Parameter
|
||||
|
||||
### Format Variations
|
||||
|
||||
**String (comma-separated):**
|
||||
```python
|
||||
symbols="ES.c.0,NQ.c.0,GC.c.0"
|
||||
```
|
||||
|
||||
**Array:**
|
||||
```python
|
||||
symbols=["ES.c.0", "NQ.c.0", "GC.c.0"]
|
||||
```
|
||||
|
||||
**Single symbol:**
|
||||
```python
|
||||
symbols="ES.c.0"
|
||||
# or
|
||||
symbols=["ES.c.0"]
|
||||
```
|
||||
|
||||
### Limits
|
||||
- Maximum: 2000 symbols per request
|
||||
- Must match stype_in format
|
||||
|
||||
### Symbol Wildcards
|
||||
|
||||
Some endpoints support wildcards:
|
||||
```
|
||||
"ES*" # All ES contracts
|
||||
"*" # All instruments (use with caution)
|
||||
```
|
||||
|
||||
## Encoding Parameter (Batch Jobs)
|
||||
|
||||
Output format for batch download jobs.
|
||||
|
||||
```
|
||||
"dbn" # Databento Binary (native format, most efficient)
|
||||
"csv" # Comma-separated values
|
||||
"json" # JSON format
|
||||
```
|
||||
|
||||
**Recommendations:**
|
||||
- `"dbn"` - Best for large datasets, fastest processing
|
||||
- `"csv"` - Good for spreadsheet analysis
|
||||
- `"json"` - Good for custom parsing, human-readable
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# batch_submit_job
|
||||
encoding="dbn"
|
||||
```
|
||||
|
||||
## Compression Parameter (Batch Jobs)
|
||||
|
||||
Compression algorithm for batch downloads.
|
||||
|
||||
```
|
||||
"zstd" # Zstandard (default, best compression)
|
||||
"gzip" # Gzip (widely supported)
|
||||
"none" # No compression
|
||||
```
|
||||
|
||||
**Recommendations:**
|
||||
- `"zstd"` - Best compression ratio, fastest
|
||||
- `"gzip"` - Good compatibility
|
||||
- `"none"` - Only for small datasets or testing
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# batch_submit_job
|
||||
compression="zstd"
|
||||
```
|
||||
|
||||
## Limit Parameter
|
||||
|
||||
Maximum number of records to return.
|
||||
|
||||
**Default:** 100 (varies by tool)
|
||||
**Maximum:** No hard limit, but consider:
|
||||
- Timeseries: practical limit ~10M records
|
||||
- Batch jobs: unlimited but affects processing time
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# timeseries_get_range
|
||||
limit=1000 # Return up to 1000 records
|
||||
```
|
||||
|
||||
**Important:** For large datasets, use batch jobs instead of increasing limit.
|
||||
|
||||
## Timeframe Parameter (get_historical_bars)
|
||||
|
||||
Specific to the `get_historical_bars` convenience tool.
|
||||
|
||||
```
|
||||
"1h" # 1-hour bars
|
||||
"H4" # 4-hour bars (alternative notation)
|
||||
"1d" # Daily bars
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# get_historical_bars (ES/NQ only)
|
||||
timeframe="1h"
|
||||
count=100
|
||||
```
|
||||
|
||||
## Symbol Parameter (get_futures_quote)
|
||||
|
||||
Specific to the `get_futures_quote` tool.
|
||||
|
||||
```
|
||||
"ES" # E-mini S&P 500
|
||||
"NQ" # E-mini Nasdaq-100
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# get_futures_quote
|
||||
symbol="ES"
|
||||
```
|
||||
|
||||
**Note:** Uses root symbol only, not full contract code.
|
||||
|
||||
## Split Parameters (Batch Jobs)
|
||||
|
||||
Control how batch job output files are split.
|
||||
|
||||
### split_duration
|
||||
```
|
||||
"day" # One file per day
|
||||
"week" # One file per week
|
||||
"month" # One file per month
|
||||
"none" # Single file (default)
|
||||
```
|
||||
|
||||
### split_size
|
||||
```
|
||||
split_size=1000000000 # Split at 1GB
|
||||
split_size=5000000000 # Split at 5GB
|
||||
```
|
||||
|
||||
### split_symbols
|
||||
```
|
||||
split_symbols=True # One file per symbol
|
||||
split_symbols=False # All symbols in same file (default)
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# batch_submit_job
|
||||
split_duration="day" # Daily files
|
||||
split_symbols=True # Separate file per symbol
|
||||
```
|
||||
|
||||
## Filter Parameters
|
||||
|
||||
### State Filter (list_jobs)
|
||||
```
|
||||
states=["received", "queued", "processing", "done", "expired"]
|
||||
```
|
||||
|
||||
### Time Filter (list_jobs)
|
||||
```
|
||||
since="2024-01-01T00:00:00Z" # Jobs since this timestamp
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# batch_list_jobs
|
||||
states=["done", "processing"]
|
||||
since="2024-01-01"
|
||||
```
|
||||
|
||||
## Mode Parameter (get_cost)
|
||||
|
||||
Query mode for cost estimation.
|
||||
|
||||
```
|
||||
"historical" # Historical data (default)
|
||||
"historical-streaming" # Streaming historical
|
||||
"live" # Live data
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# metadata_get_cost
|
||||
mode="historical"
|
||||
```
|
||||
|
||||
## Complete Parameter Examples
|
||||
|
||||
### timeseries_get_range
|
||||
```python
|
||||
{
|
||||
"dataset": "GLBX.MDP3",
|
||||
"symbols": "ES.c.0,NQ.c.0",
|
||||
"schema": "ohlcv-1h",
|
||||
"start": "2024-01-01",
|
||||
"end": "2024-01-31",
|
||||
"stype_in": "continuous",
|
||||
"stype_out": "instrument_id",
|
||||
"limit": 1000
|
||||
}
|
||||
```
|
||||
|
||||
### batch_submit_job
|
||||
```python
|
||||
{
|
||||
"dataset": "GLBX.MDP3",
|
||||
"symbols": ["ES.c.0", "NQ.c.0"],
|
||||
"schema": "trades",
|
||||
"start": "2024-01-01",
|
||||
"end": "2024-12-31",
|
||||
"stype_in": "continuous",
|
||||
"stype_out": "instrument_id",
|
||||
"encoding": "dbn",
|
||||
"compression": "zstd",
|
||||
"split_duration": "day",
|
||||
"split_symbols": False
|
||||
}
|
||||
```
|
||||
|
||||
### symbology_resolve
|
||||
```python
|
||||
{
|
||||
"dataset": "GLBX.MDP3",
|
||||
"symbols": ["ES.c.0", "NQ.c.0"],
|
||||
"stype_in": "continuous",
|
||||
"stype_out": "instrument_id",
|
||||
"start_date": "2024-01-01",
|
||||
"end_date": "2024-12-31"
|
||||
}
|
||||
```
|
||||
|
||||
### metadata_get_cost
|
||||
```python
|
||||
{
|
||||
"dataset": "GLBX.MDP3",
|
||||
"start": "2024-01-01",
|
||||
"end": "2024-01-31",
|
||||
"symbols": "ES.c.0",
|
||||
"schema": "ohlcv-1h",
|
||||
"stype_in": "continuous",
|
||||
"mode": "historical"
|
||||
}
|
||||
```
|
||||
|
||||
### get_futures_quote
|
||||
```python
|
||||
{
|
||||
"symbol": "ES" # or "NQ"
|
||||
}
|
||||
```
|
||||
|
||||
### get_session_info
|
||||
```python
|
||||
{
|
||||
"timestamp": "2024-01-15T14:30:00Z" # Optional
|
||||
}
|
||||
```
|
||||
|
||||
### get_historical_bars
|
||||
```python
|
||||
{
|
||||
"symbol": "ES", # or "NQ"
|
||||
"timeframe": "1h",
|
||||
"count": 100
|
||||
}
|
||||
```
|
||||
|
||||
## Common Parameter Mistakes
|
||||
|
||||
### 1. Wrong stype_in for Symbol Format
|
||||
**Wrong:**
|
||||
```python
|
||||
symbols="ES.c.0"
|
||||
stype_in="raw_symbol" # WRONG!
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```python
|
||||
symbols="ES.c.0"
|
||||
stype_in="continuous"
|
||||
```
|
||||
|
||||
### 2. Date Format Errors
|
||||
**Wrong:**
|
||||
```python
|
||||
start="01/15/2024" # US date format - WRONG
|
||||
start="15-01-2024" # Non-ISO format - WRONG
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```python
|
||||
start="2024-01-15" # ISO format - CORRECT
|
||||
```
|
||||
|
||||
### 3. Missing Required Parameters
|
||||
**Wrong:**
|
||||
```python
|
||||
# metadata_get_cost
|
||||
dataset="GLBX.MDP3"
|
||||
start="2024-01-01"
|
||||
# Missing symbols and schema!
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```python
|
||||
dataset="GLBX.MDP3"
|
||||
start="2024-01-01"
|
||||
symbols="ES.c.0"
|
||||
schema="ohlcv-1h"
|
||||
```
|
||||
|
||||
### 4. Schema Typos
|
||||
**Wrong:**
|
||||
```python
|
||||
schema="OHLCV-1H" # Wrong case
|
||||
schema="ohlcv-1hour" # Wrong format
|
||||
schema="ohlcv_1h" # Wrong separator
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```python
|
||||
schema="ohlcv-1h" # Lowercase, hyphenated
|
||||
```
|
||||
|
||||
### 5. Symbol Array vs String Confusion
|
||||
**Wrong:**
|
||||
```python
|
||||
# batch_submit_job expects array
|
||||
symbols="ES.c.0,NQ.c.0" # WRONG for batch jobs
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```python
|
||||
# batch_submit_job
|
||||
symbols=["ES.c.0", "NQ.c.0"] # CORRECT
|
||||
```
|
||||
|
||||
### 6. Encoding/Compression Not Strings
|
||||
**Wrong:**
|
||||
```python
|
||||
encoding=dbn # Not a string
|
||||
compression=zstd # Not a string
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```python
|
||||
encoding="dbn"
|
||||
compression="zstd"
|
||||
```
|
||||
|
||||
## Parameter Validation Checklist
|
||||
|
||||
Before making API calls, verify:
|
||||
|
||||
- [ ] Date format is YYYY-MM-DD or ISO 8601
|
||||
- [ ] Dataset matches your data source (GLBX.MDP3 for ES/NQ)
|
||||
- [ ] Schema is valid and lowercase
|
||||
- [ ] stype_in matches symbol format
|
||||
- [ ] Symbols parameter matches tool expectation (string vs array)
|
||||
- [ ] All required parameters are present
|
||||
- [ ] Enum values are exact strings (case-sensitive)
|
||||
- [ ] start_date <= end_date
|
||||
- [ ] limit is reasonable for dataset size
|
||||
|
||||
## Quick Reference: Required Parameters
|
||||
|
||||
### timeseries_get_range
|
||||
**Required:** dataset, symbols, schema, start
|
||||
|
||||
**Optional:** end, stype_in, stype_out, limit
|
||||
|
||||
### batch_submit_job
|
||||
**Required:** dataset, symbols, schema, start
|
||||
|
||||
**Optional:** end, stype_in, stype_out, encoding, compression, split_duration, split_size, split_symbols, limit
|
||||
|
||||
### symbology_resolve
|
||||
**Required:** dataset, symbols, stype_in, stype_out, start_date
|
||||
|
||||
**Optional:** end_date
|
||||
|
||||
### metadata_get_cost
|
||||
**Required:** dataset, start
|
||||
|
||||
**Optional:** end, symbols, schema, stype_in, mode
|
||||
|
||||
### get_futures_quote
|
||||
**Required:** symbol
|
||||
|
||||
### get_session_info
|
||||
**Optional:** timestamp
|
||||
|
||||
### get_historical_bars
|
||||
**Required:** symbol, timeframe, count
|
||||
501
skills/databento/references/cost-optimization.md
Normal file
501
skills/databento/references/cost-optimization.md
Normal file
@@ -0,0 +1,501 @@
|
||||
# Databento Cost Optimization Guide
|
||||
|
||||
Strategies and best practices for minimizing costs when working with Databento market data.
|
||||
|
||||
## Databento Pricing Model
|
||||
|
||||
### Cost Components
|
||||
|
||||
1. **Databento Usage Fees** - Pay-per-use or subscription
|
||||
2. **Exchange License Fees** - Venue-dependent (varies by exchange)
|
||||
3. **Data Volume** - Amount of data retrieved
|
||||
|
||||
### Pricing Tiers
|
||||
|
||||
**Free Credits:**
|
||||
- $125 free credits for new users
|
||||
- Good for initial development and testing
|
||||
|
||||
**Usage-Based:**
|
||||
- Pay only for data you use
|
||||
- Varies by venue and data type
|
||||
- No minimum commitment
|
||||
|
||||
**Subscriptions:**
|
||||
- Basic Plan: $199/month
|
||||
- Corporate Actions/Security Master: $299/month
|
||||
- Flat-rate access to specific datasets
|
||||
|
||||
## Cost Estimation (ALWAYS Do This First)
|
||||
|
||||
### Use metadata_get_cost Before Every Request
|
||||
|
||||
**Always** estimate cost before fetching data:
|
||||
|
||||
```python
|
||||
mcp__databento__metadata_get_cost(
|
||||
dataset="GLBX.MDP3",
|
||||
start="2024-01-01",
|
||||
end="2024-01-31",
|
||||
symbols="ES.c.0",
|
||||
schema="ohlcv-1h"
|
||||
)
|
||||
```
|
||||
|
||||
**Returns:**
|
||||
- Estimated cost in USD
|
||||
- Data size estimate
|
||||
- Helps decide if request is reasonable
|
||||
|
||||
### When Cost Checks Matter Most
|
||||
|
||||
1. **Multi-day tick data** - Can be expensive
|
||||
2. **Multiple symbols** - Costs multiply
|
||||
3. **High-granularity schemas** - trades, mbp-1, mbo
|
||||
4. **Long date ranges** - Weeks or months of data
|
||||
|
||||
**Example Cost Check:**
|
||||
```python
|
||||
# Cheap: 1 month of daily bars
|
||||
cost_check(schema="ohlcv-1d", start="2024-01-01", end="2024-01-31")
|
||||
# Estimated: $0.10
|
||||
|
||||
# Expensive: 1 month of tick trades
|
||||
cost_check(schema="trades", start="2024-01-01", end="2024-01-31")
|
||||
# Estimated: $50-$200 (depends on volume)
|
||||
```
|
||||
|
||||
## Historical Data (T+1) - No Licensing Required
|
||||
|
||||
**Key Insight:** Historical data that is **24+ hours old (T+1)** does not require exchange licensing fees.
|
||||
|
||||
### Cost Breakdown
|
||||
|
||||
**Live/Recent Data (< 24 hours):**
|
||||
- Databento fees + Exchange licensing fees
|
||||
|
||||
**Historical Data (24+ hours old):**
|
||||
- Databento fees only (no exchange licensing)
|
||||
- Significantly cheaper
|
||||
|
||||
### Optimization Strategy
|
||||
|
||||
**For Development:**
|
||||
- Use T+1 data for strategy development
|
||||
- Switch to live data only for production
|
||||
|
||||
**For Backtesting:**
|
||||
- Always use historical (T+1) data
|
||||
- Much more cost-effective
|
||||
- Same data quality
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Expensive: Yesterday's data (< 24 hours)
|
||||
start="2024-11-05" # Requires licensing
|
||||
|
||||
# Cheap: 3 days ago (> 24 hours)
|
||||
start="2024-11-03" # No licensing required
|
||||
```
|
||||
|
||||
## Schema Selection for Cost
|
||||
|
||||
Different schemas have vastly different costs due to data volume.
|
||||
|
||||
### Schema Cost Hierarchy (Cheapest to Most Expensive)
|
||||
|
||||
1. **ohlcv-1d** (Cheapest)
|
||||
- ~100 bytes per record
|
||||
- ~250 records per symbol per year
|
||||
- **Best for:** Long-term backtesting
|
||||
|
||||
2. **ohlcv-1h**
|
||||
- ~100 bytes per record
|
||||
- ~6,000 records per symbol per year
|
||||
- **Best for:** Multi-day backtesting
|
||||
|
||||
3. **ohlcv-1m**
|
||||
- ~100 bytes per record
|
||||
- ~360,000 records per symbol per year
|
||||
- **Best for:** Intraday strategies
|
||||
|
||||
4. **trades**
|
||||
- ~50 bytes per record
|
||||
- ~100K-500K records per symbol per day (ES/NQ)
|
||||
- **Best for:** Tick analysis (use selectively)
|
||||
|
||||
5. **mbp-1**
|
||||
- ~150 bytes per record
|
||||
- ~1M-5M records per symbol per day
|
||||
- **Best for:** Order flow analysis (use selectively)
|
||||
|
||||
6. **mbp-10**
|
||||
- ~500 bytes per record
|
||||
- ~1M-5M records per symbol per day
|
||||
- **Best for:** Deep order book analysis (expensive!)
|
||||
|
||||
7. **mbo** (Most Expensive)
|
||||
- ~80 bytes per record
|
||||
- ~5M-20M records per symbol per day
|
||||
- **Best for:** Order-level research (very expensive!)
|
||||
|
||||
### Cost Optimization Strategy
|
||||
|
||||
**Start with lower granularity:**
|
||||
1. Develop strategy with ohlcv-1h or ohlcv-1d
|
||||
2. Validate with ohlcv-1m if needed
|
||||
3. Only use trades/mbp-1 if absolutely necessary
|
||||
4. Avoid mbp-10/mbo unless essential
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Cheap: Daily bars for 1 year
|
||||
schema="ohlcv-1d"
|
||||
start="2023-01-01"
|
||||
end="2023-12-31"
|
||||
# Cost: < $1
|
||||
|
||||
# Expensive: Trades for 1 year
|
||||
schema="trades"
|
||||
start="2023-01-01"
|
||||
end="2023-12-31"
|
||||
# Cost: $500-$2000 (depending on venue)
|
||||
```
|
||||
|
||||
## Symbol Selection
|
||||
|
||||
Fewer symbols = lower cost. Be selective.
|
||||
|
||||
### Strategies
|
||||
|
||||
**1. Start with Single Symbol**
|
||||
```python
|
||||
# Development
|
||||
symbols="ES.c.0" # Just ES
|
||||
|
||||
# After validation, expand
|
||||
symbols="ES.c.0,NQ.c.0" # Add NQ
|
||||
```
|
||||
|
||||
**2. Use Continuous Contracts**
|
||||
```python
|
||||
# Good: Single continuous contract
|
||||
symbols="ES.c.0" # Covers all front months
|
||||
|
||||
# Wasteful: Multiple specific contracts
|
||||
symbols="ESH5,ESM5,ESU5,ESZ5" # Same data, 4x cost
|
||||
```
|
||||
|
||||
**3. Avoid Symbol Wildcards**
|
||||
```python
|
||||
# Expensive: All instruments
|
||||
symbols="*" # Don't do this!
|
||||
|
||||
# Targeted: Just what you need
|
||||
symbols="ES.c.0,NQ.c.0" # Explicit
|
||||
```
|
||||
|
||||
## Date Range Optimization
|
||||
|
||||
Request only the data you need.
|
||||
|
||||
### Strategies
|
||||
|
||||
**1. Iterative Refinement**
|
||||
```python
|
||||
# First: Test with small range
|
||||
start="2024-01-01"
|
||||
end="2024-01-07" # Just 1 week
|
||||
|
||||
# Then: Expand after validation
|
||||
start="2024-01-01"
|
||||
end="2024-12-31" # Full year
|
||||
```
|
||||
|
||||
**2. Segment Long Ranges**
|
||||
```python
|
||||
# Instead of: 5 years at once
|
||||
start="2019-01-01"
|
||||
end="2024-12-31"
|
||||
|
||||
# Do: Segment by year
|
||||
start="2024-01-01"
|
||||
end="2024-12-31"
|
||||
# Process, then request next year if needed
|
||||
```
|
||||
|
||||
**3. Use Limit for Testing**
|
||||
```python
|
||||
# Test with small limit first
|
||||
limit=100 # Just 100 records
|
||||
|
||||
# After validation, increase or remove
|
||||
limit=10000 # Larger sample
|
||||
```
|
||||
|
||||
## Batch vs Timeseries Selection
|
||||
|
||||
Choose the right tool for the job.
|
||||
|
||||
### Timeseries (< 5GB)
|
||||
**When to use:**
|
||||
- Small to medium datasets
|
||||
- Quick exploration
|
||||
- <= 1 day of tick data
|
||||
- Any OHLCV data
|
||||
|
||||
**Benefits:**
|
||||
- Immediate results
|
||||
- No job management
|
||||
- Direct response
|
||||
|
||||
**Costs:**
|
||||
- Same per-record cost as batch
|
||||
|
||||
### Batch Downloads (> 5GB)
|
||||
**When to use:**
|
||||
- Large datasets (> 5GB)
|
||||
- Multi-day tick data
|
||||
- Multiple symbols over long periods
|
||||
- Production data pipelines
|
||||
|
||||
**Benefits:**
|
||||
- More efficient for large data
|
||||
- Can split output files
|
||||
- Asynchronous processing
|
||||
|
||||
**Costs:**
|
||||
- Same per-record cost as timeseries
|
||||
- No additional fees for batch processing
|
||||
|
||||
### Decision Matrix
|
||||
|
||||
| Data Type | Date Range | Method |
|
||||
|-----------|-----------|--------|
|
||||
| ohlcv-1h | 1 year | Timeseries |
|
||||
| ohlcv-1d | Any | Timeseries |
|
||||
| trades | 1 day | Timeseries |
|
||||
| trades | 1 week+ | Batch |
|
||||
| mbp-1 | 1 day | Batch (safer) |
|
||||
| mbp-1 | 1 week+ | Batch |
|
||||
|
||||
## DBEQ Bundle - Zero Exchange Fees
|
||||
|
||||
Databento offers a special bundle for US equities with **$0 exchange fees**.
|
||||
|
||||
### DBEQ.BASIC Dataset
|
||||
|
||||
**Coverage:**
|
||||
- US equity securities
|
||||
- Zero licensing fees
|
||||
- Databento usage fees only
|
||||
|
||||
**When to use:**
|
||||
- Equity market breadth for ES/NQ analysis
|
||||
- Testing equity strategies
|
||||
- Learning market data APIs
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Regular equity dataset (has exchange fees)
|
||||
dataset="XNAS.ITCH"
|
||||
# Cost: Databento + Nasdaq fees
|
||||
|
||||
# DBEQ bundle (no exchange fees)
|
||||
dataset="DBEQ.BASIC"
|
||||
# Cost: Databento fees only
|
||||
```
|
||||
|
||||
## Caching and Reuse
|
||||
|
||||
Don't fetch the same data multiple times.
|
||||
|
||||
### Strategies
|
||||
|
||||
**1. Cache Locally**
|
||||
```python
|
||||
# First request: Fetch and save
|
||||
data = fetch_data(...)
|
||||
save_to_disk(data, "ES_2024_ohlcv1h.csv")
|
||||
|
||||
# Subsequent runs: Load from disk
|
||||
data = load_from_disk("ES_2024_ohlcv1h.csv")
|
||||
```
|
||||
|
||||
**2. Incremental Updates**
|
||||
```python
|
||||
# Initial: Fetch full history
|
||||
start="2023-01-01"
|
||||
end="2024-01-01"
|
||||
|
||||
# Later: Fetch only new data
|
||||
start="2024-01-01" # Resume from last fetch
|
||||
end="2024-12-31"
|
||||
```
|
||||
|
||||
**3. Share Data Across Analyses**
|
||||
```python
|
||||
# Fetch once
|
||||
historical_data = fetch_data(schema="ohlcv-1h", ...)
|
||||
|
||||
# Use multiple times
|
||||
backtest_strategy_a(historical_data)
|
||||
backtest_strategy_b(historical_data)
|
||||
backtest_strategy_c(historical_data)
|
||||
```
|
||||
|
||||
## Session-Based Analysis
|
||||
|
||||
For ES/NQ, consider filtering by trading session to reduce data volume.
|
||||
|
||||
### Sessions
|
||||
|
||||
- **Asian Session:** 6pm-2am ET
|
||||
- **London Session:** 2am-8am ET
|
||||
- **New York Session:** 8am-4pm ET
|
||||
|
||||
### Cost Benefit
|
||||
|
||||
**Full 24-hour data:**
|
||||
- Maximum data volume
|
||||
- Higher cost
|
||||
|
||||
**Session-filtered data:**
|
||||
- 1/3 to 1/2 the volume
|
||||
- Lower cost
|
||||
- May be sufficient for analysis
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Expensive: Full 24-hour data
|
||||
# Process all records
|
||||
|
||||
# Cheaper: NY session only
|
||||
# Filter records to 8am-4pm ET
|
||||
# ~1/3 the data volume
|
||||
```
|
||||
|
||||
Use `scripts/session_filter.py` to filter post-fetch, or request only specific hours.
|
||||
|
||||
## Monitoring Usage
|
||||
|
||||
Track your usage to avoid surprises.
|
||||
|
||||
### Check Dashboard
|
||||
- Databento provides usage dashboard
|
||||
- Monitor monthly spend
|
||||
- Set alerts for limits
|
||||
|
||||
### Set Monthly Limits
|
||||
```python
|
||||
# In account settings
|
||||
monthly_limit=$500
|
||||
```
|
||||
|
||||
### Review Costs Regularly
|
||||
- Check cost estimates vs actual
|
||||
- Identify expensive queries
|
||||
- Adjust strategies
|
||||
|
||||
## Cost Optimization Checklist
|
||||
|
||||
Before every data request:
|
||||
|
||||
- [ ] **Estimate cost first** - Use metadata_get_cost
|
||||
- [ ] **Use T+1 data** - Avoid < 24 hour data unless necessary
|
||||
- [ ] **Choose lowest granularity schema** - Start with ohlcv, not trades
|
||||
- [ ] **Minimize symbols** - Only request what you need
|
||||
- [ ] **Limit date range** - Test with small range first
|
||||
- [ ] **Use continuous contracts** - Avoid requesting multiple months
|
||||
- [ ] **Cache locally** - Don't re-fetch same data
|
||||
- [ ] **Consider DBEQ** - Use zero-fee dataset when applicable
|
||||
- [ ] **Filter by session** - Reduce volume if session-specific
|
||||
- [ ] **Use batch for large data** - More efficient for > 5GB
|
||||
|
||||
## Cost Examples
|
||||
|
||||
### Cheap Requests (< $1)
|
||||
|
||||
```python
|
||||
# Daily bars for 1 year
|
||||
dataset="GLBX.MDP3"
|
||||
symbols="ES.c.0"
|
||||
schema="ohlcv-1d"
|
||||
start="2023-01-01"
|
||||
end="2023-12-31"
|
||||
# Estimated cost: $0.10
|
||||
```
|
||||
|
||||
### Moderate Requests ($1-$10)
|
||||
|
||||
```python
|
||||
# Hourly bars for 1 year
|
||||
dataset="GLBX.MDP3"
|
||||
symbols="ES.c.0,NQ.c.0"
|
||||
schema="ohlcv-1h"
|
||||
start="2023-01-01"
|
||||
end="2023-12-31"
|
||||
# Estimated cost: $2-5
|
||||
```
|
||||
|
||||
### Expensive Requests ($10-$100)
|
||||
|
||||
```python
|
||||
# Trades for 1 month
|
||||
dataset="GLBX.MDP3"
|
||||
symbols="ES.c.0"
|
||||
schema="trades"
|
||||
start="2024-01-01"
|
||||
end="2024-01-31"
|
||||
# Estimated cost: $20-50
|
||||
```
|
||||
|
||||
### Very Expensive Requests ($100+)
|
||||
|
||||
```python
|
||||
# MBP-10 for 1 month
|
||||
dataset="GLBX.MDP3"
|
||||
symbols="ES.c.0,NQ.c.0"
|
||||
schema="mbp-10"
|
||||
start="2024-01-01"
|
||||
end="2024-01-31"
|
||||
# Estimated cost: $200-500
|
||||
```
|
||||
|
||||
## Free Credit Strategy
|
||||
|
||||
Make the most of your $125 free credits:
|
||||
|
||||
1. **Development Phase** - Use free credits for:
|
||||
- Testing API integration
|
||||
- Small-scale strategy development
|
||||
- Learning the platform
|
||||
|
||||
2. **Prioritize T+1 Data** - Stretch credits further:
|
||||
- Avoid real-time data during development
|
||||
- Use historical data (no licensing fees)
|
||||
|
||||
3. **Start with OHLCV** - Cheapest data:
|
||||
- Develop strategy with daily/hourly bars
|
||||
- Validate before moving to tick data
|
||||
|
||||
4. **Cache Everything** - Don't waste credits:
|
||||
- Save all fetched data locally
|
||||
- Reuse for multiple analyses
|
||||
|
||||
5. **Monitor Remaining Balance**:
|
||||
- Check credit usage regularly
|
||||
- Adjust requests to stay within budget
|
||||
|
||||
## Summary
|
||||
|
||||
**Most Important Cost-Saving Strategies:**
|
||||
|
||||
1. ✅ **Always check cost first** - Use metadata_get_cost
|
||||
2. ✅ **Use T+1 data** - 24+ hours old, no licensing fees
|
||||
3. ✅ **Start with OHLCV schemas** - Much cheaper than tick data
|
||||
4. ✅ **Cache and reuse data** - Don't fetch twice
|
||||
5. ✅ **Be selective with symbols** - Fewer symbols = lower cost
|
||||
6. ✅ **Test with small ranges** - Validate before large requests
|
||||
7. ✅ **Use continuous contracts** - One symbol instead of many
|
||||
8. ✅ **Monitor usage** - Track spending, set limits
|
||||
372
skills/databento/references/schemas.md
Normal file
372
skills/databento/references/schemas.md
Normal file
@@ -0,0 +1,372 @@
|
||||
# Databento Schema Reference
|
||||
|
||||
Comprehensive documentation of Databento schemas with field-level details, data types, and usage guidance.
|
||||
|
||||
## Schema Overview
|
||||
|
||||
Databento provides 12+ schema types representing different granularity levels of market data. All schemas share common timestamp fields for consistency.
|
||||
|
||||
## Common Fields (All Schemas)
|
||||
|
||||
Every schema includes these timestamp fields:
|
||||
|
||||
| Field | Type | Description | Unit |
|
||||
|-------|------|-------------|------|
|
||||
| `ts_event` | uint64 | Event timestamp from venue | Nanoseconds (Unix epoch) |
|
||||
| `ts_recv` | uint64 | Databento gateway receipt time | Nanoseconds (Unix epoch) |
|
||||
|
||||
**Important:** Databento provides up to 4 timestamps per event for sub-microsecond accuracy.
|
||||
|
||||
## OHLCV Schemas
|
||||
|
||||
Candlestick/bar data at various time intervals.
|
||||
|
||||
### ohlcv-1s (1 Second Bars)
|
||||
### ohlcv-1m (1 Minute Bars)
|
||||
### ohlcv-1h (1 Hour Bars)
|
||||
### ohlcv-1d (Daily Bars)
|
||||
### ohlcv-eod (End of Day)
|
||||
|
||||
**Common OHLCV Fields:**
|
||||
|
||||
| Field | Type | Description | Unit |
|
||||
|-------|------|-------------|------|
|
||||
| `open` | int64 | Opening price | Fixed-point (divide by 1e9 for decimal) |
|
||||
| `high` | int64 | Highest price | Fixed-point (divide by 1e9 for decimal) |
|
||||
| `low` | int64 | Lowest price | Fixed-point (divide by 1e9 for decimal) |
|
||||
| `close` | int64 | Closing price | Fixed-point (divide by 1e9 for decimal) |
|
||||
| `volume` | uint64 | Total volume | Contracts/shares |
|
||||
|
||||
**When to Use:**
|
||||
- **1h/1d**: Historical backtesting, multi-day analysis
|
||||
- **1m**: Intraday strategy development
|
||||
- **1s**: High-frequency analysis (use batch for large ranges)
|
||||
- **eod**: Long-term investment analysis
|
||||
|
||||
**Pricing Format:**
|
||||
Prices are in fixed-point notation. To convert to decimal:
|
||||
```
|
||||
decimal_price = int64_price / 1_000_000_000
|
||||
```
|
||||
|
||||
For ES futures at 4500.00, the value would be stored as `4500000000000`.
|
||||
|
||||
## Trades Schema
|
||||
|
||||
Individual trade executions with price, size, and side information.
|
||||
|
||||
| Field | Type | Description | Values |
|
||||
|-------|------|-------------|--------|
|
||||
| `price` | int64 | Trade execution price | Fixed-point (÷ 1e9) |
|
||||
| `size` | uint32 | Trade size | Contracts/shares |
|
||||
| `action` | char | Trade action | 'T' = trade, 'C' = cancel |
|
||||
| `side` | char | Aggressor side | 'B' = buy, 'S' = sell, 'N' = none |
|
||||
| `flags` | uint8 | Trade flags | Bitmask |
|
||||
| `depth` | uint8 | Depth level | Usually 0 |
|
||||
| `ts_in_delta` | int32 | Time delta | Nanoseconds |
|
||||
| `sequence` | uint32 | Sequence number | Venue-specific |
|
||||
|
||||
**When to Use:**
|
||||
- Intraday order flow analysis
|
||||
- Tick-by-tick backtesting
|
||||
- Market microstructure research
|
||||
- Volume profile analysis
|
||||
|
||||
**Aggressor Side:**
|
||||
- `B` = Buy-side aggressor (market buy hit the ask)
|
||||
- `S` = Sell-side aggressor (market sell hit the bid)
|
||||
- `N` = Cannot be determined or not applicable
|
||||
|
||||
**Important:** For multi-day tick data, use batch downloads. Trades can generate millions of records per day.
|
||||
|
||||
## MBP-1 Schema (Market By Price - Top of Book)
|
||||
|
||||
Level 1 order book data showing best bid and ask.
|
||||
|
||||
| Field | Type | Description | Values |
|
||||
|-------|------|-------------|--------|
|
||||
| `price` | int64 | Reference price (usually last trade) | Fixed-point (÷ 1e9) |
|
||||
| `size` | uint32 | Reference size | Contracts/shares |
|
||||
| `action` | char | Book action | 'A' = add, 'C' = cancel, 'M' = modify, 'T' = trade |
|
||||
| `side` | char | Order side | 'B' = bid, 'A' = ask, 'N' = none |
|
||||
| `flags` | uint8 | Flags | Bitmask |
|
||||
| `depth` | uint8 | Depth level | Always 0 for MBP-1 |
|
||||
| `ts_in_delta` | int32 | Time delta | Nanoseconds |
|
||||
| `sequence` | uint32 | Sequence number | Venue-specific |
|
||||
| `bid_px_00` | int64 | Best bid price | Fixed-point (÷ 1e9) |
|
||||
| `ask_px_00` | int64 | Best ask price | Fixed-point (÷ 1e9) |
|
||||
| `bid_sz_00` | uint32 | Best bid size | Contracts/shares |
|
||||
| `ask_sz_00` | uint32 | Best ask size | Contracts/shares |
|
||||
| `bid_ct_00` | uint32 | Bid order count | Number of orders |
|
||||
| `ask_ct_00` | uint32 | Ask order count | Number of orders |
|
||||
|
||||
**When to Use:**
|
||||
- Bid/ask spread analysis
|
||||
- Liquidity analysis
|
||||
- Market microstructure studies
|
||||
- Quote-based strategies
|
||||
|
||||
**Key Metrics:**
|
||||
```
|
||||
spread = ask_px_00 - bid_px_00
|
||||
mid_price = (bid_px_00 + ask_px_00) / 2
|
||||
bid_ask_imbalance = (bid_sz_00 - ask_sz_00) / (bid_sz_00 + ask_sz_00)
|
||||
```
|
||||
|
||||
## MBP-10 Schema (Market By Price - 10 Levels)
|
||||
|
||||
Level 2 order book data showing 10 levels of depth.
|
||||
|
||||
**Fields:** Same as MBP-1, plus 9 additional levels:
|
||||
- `bid_px_01` through `bid_px_09` (10 bid levels)
|
||||
- `ask_px_01` through `ask_px_09` (10 ask levels)
|
||||
- `bid_sz_01` through `bid_sz_09`
|
||||
- `ask_sz_01` through `ask_sz_09`
|
||||
- `bid_ct_01` through `bid_ct_09`
|
||||
- `ask_ct_01` through `ask_ct_09`
|
||||
|
||||
**When to Use:**
|
||||
- Order book depth analysis
|
||||
- Liquidity beyond top of book
|
||||
- Order flow imbalance at multiple levels
|
||||
- Market impact modeling
|
||||
|
||||
**Important:** MBP-10 generates significantly more data than MBP-1. Use batch downloads for multi-day requests.
|
||||
|
||||
## MBO Schema (Market By Order)
|
||||
|
||||
Level 3 order-level data with individual order IDs - most granular.
|
||||
|
||||
| Field | Type | Description | Values |
|
||||
|-------|------|-------------|--------|
|
||||
| `order_id` | uint64 | Unique order ID | Venue-specific |
|
||||
| `price` | int64 | Order price | Fixed-point (÷ 1e9) |
|
||||
| `size` | uint32 | Order size | Contracts/shares |
|
||||
| `flags` | uint8 | Flags | Bitmask |
|
||||
| `channel_id` | uint8 | Channel ID | Venue-specific |
|
||||
| `action` | char | Order action | 'A' = add, 'C' = cancel, 'M' = modify, 'F' = fill, 'T' = trade |
|
||||
| `side` | char | Order side | 'B' = bid, 'A' = ask, 'N' = none |
|
||||
| `ts_in_delta` | int32 | Time delta | Nanoseconds |
|
||||
| `sequence` | uint32 | Sequence number | Venue-specific |
|
||||
|
||||
**When to Use:**
|
||||
- Highest granularity order flow analysis
|
||||
- Order-level reconstructions
|
||||
- Advanced market microstructure research
|
||||
- Queue position analysis
|
||||
|
||||
**Important:** MBO data is extremely granular and generates massive datasets. Always use batch downloads and carefully check costs.
|
||||
|
||||
## Definition Schema
|
||||
|
||||
Instrument metadata and definitions.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `ts_recv` | uint64 | Receipt timestamp |
|
||||
| `min_price_increment` | int64 | Minimum tick size |
|
||||
| `display_factor` | int64 | Display factor for prices |
|
||||
| `expiration` | uint64 | Contract expiration timestamp |
|
||||
| `activation` | uint64 | Contract activation timestamp |
|
||||
| `high_limit_price` | int64 | Upper price limit |
|
||||
| `low_limit_price` | int64 | Lower price limit |
|
||||
| `max_price_variation` | int64 | Maximum price move |
|
||||
| `trading_reference_price` | int64 | Reference price |
|
||||
| `unit_of_measure_qty` | int64 | Contract size |
|
||||
| `min_price_increment_amount` | int64 | Tick value |
|
||||
| `price_ratio` | int64 | Price ratio |
|
||||
| `inst_attrib_value` | int32 | Instrument attributes |
|
||||
| `underlying_id` | uint32 | Underlying instrument ID |
|
||||
| `raw_instrument_id` | uint32 | Raw instrument ID |
|
||||
| `market_depth_implied` | int32 | Implied depth |
|
||||
| `market_depth` | int32 | Market depth |
|
||||
| `market_segment_id` | uint32 | Market segment |
|
||||
| `max_trade_vol` | uint32 | Maximum trade volume |
|
||||
| `min_lot_size` | int32 | Minimum lot size |
|
||||
| `min_lot_size_block` | int32 | Block trade minimum |
|
||||
| `min_lot_size_round_lot` | int32 | Round lot minimum |
|
||||
| `min_trade_vol` | uint32 | Minimum trade volume |
|
||||
| `contract_multiplier` | int32 | Contract multiplier |
|
||||
| `decay_quantity` | int32 | Decay quantity |
|
||||
| `original_contract_size` | int32 | Original size |
|
||||
| `trading_reference_date` | uint16 | Reference date |
|
||||
| `appl_id` | int16 | Application ID |
|
||||
| `maturity_year` | uint16 | Year |
|
||||
| `decay_start_date` | uint16 | Decay start |
|
||||
| `channel_id` | uint16 | Channel |
|
||||
| `currency` | string | Currency code |
|
||||
| `settl_currency` | string | Settlement currency |
|
||||
| `secsubtype` | string | Security subtype |
|
||||
| `raw_symbol` | string | Raw symbol |
|
||||
| `group` | string | Instrument group |
|
||||
| `exchange` | string | Exchange code |
|
||||
| `asset` | string | Asset class |
|
||||
| `cfi` | string | CFI code |
|
||||
| `security_type` | string | Security type |
|
||||
| `unit_of_measure` | string | Unit of measure |
|
||||
| `underlying` | string | Underlying symbol |
|
||||
| `strike_price_currency` | string | Strike currency |
|
||||
| `instrument_class` | char | Class |
|
||||
| `strike_price` | int64 | Strike price (options) |
|
||||
| `match_algorithm` | char | Matching algorithm |
|
||||
| `md_security_trading_status` | uint8 | Trading status |
|
||||
| `main_fraction` | uint8 | Main fraction |
|
||||
| `price_display_format` | uint8 | Display format |
|
||||
| `settl_price_type` | uint8 | Settlement type |
|
||||
| `sub_fraction` | uint8 | Sub fraction |
|
||||
| `underlying_product` | uint8 | Underlying product |
|
||||
| `security_update_action` | char | Update action |
|
||||
| `maturity_month` | uint8 | Month |
|
||||
| `maturity_day` | uint8 | Day |
|
||||
| `maturity_week` | uint8 | Week |
|
||||
| `user_defined_instrument` | char | User-defined |
|
||||
| `contract_multiplier_unit` | int8 | Multiplier unit |
|
||||
| `flow_schedule_type` | int8 | Flow schedule |
|
||||
| `tick_rule` | uint8 | Tick rule |
|
||||
|
||||
**When to Use:**
|
||||
- Understanding instrument specifications
|
||||
- Calculating tick values
|
||||
- Contract expiration management
|
||||
- Symbol resolution and mapping
|
||||
|
||||
**Key Fields for ES/NQ:**
|
||||
- `min_price_increment`: Tick size (0.25 for ES, 0.25 for NQ)
|
||||
- `expiration`: Contract expiration timestamp
|
||||
- `raw_symbol`: Exchange symbol
|
||||
- `contract_multiplier`: Usually 50 for ES, 20 for NQ
|
||||
|
||||
## Statistics Schema
|
||||
|
||||
Market statistics and calculated metrics.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `ts_recv` | uint64 | Receipt timestamp |
|
||||
| `ts_ref` | uint64 | Reference timestamp |
|
||||
| `price` | int64 | Reference price |
|
||||
| `quantity` | int64 | Reference quantity |
|
||||
| `sequence` | uint32 | Sequence number |
|
||||
| `ts_in_delta` | int32 | Time delta |
|
||||
| `stat_type` | uint16 | Statistic type |
|
||||
| `channel_id` | uint16 | Channel ID |
|
||||
| `update_action` | uint8 | Update action |
|
||||
| `stat_flags` | uint8 | Statistic flags |
|
||||
|
||||
**Common Statistic Types:**
|
||||
- Opening price
|
||||
- Settlement price
|
||||
- High/low prices
|
||||
- Trading volume
|
||||
- Open interest
|
||||
|
||||
**When to Use:**
|
||||
- Official settlement prices
|
||||
- Open interest analysis
|
||||
- Exchange-calculated statistics
|
||||
|
||||
## Status Schema
|
||||
|
||||
Instrument trading status and state changes.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `ts_recv` | uint64 | Receipt timestamp |
|
||||
| `ts_event` | uint64 | Event timestamp |
|
||||
| `action` | uint16 | Status action |
|
||||
| `reason` | uint16 | Status reason |
|
||||
| `trading_event` | uint16 | Trading event |
|
||||
| `is_trading` | int8 | Trading flag (1 = trading, 0 = not trading) |
|
||||
| `is_quoting` | int8 | Quoting flag |
|
||||
| `is_short_sell_restricted` | int8 | Short sell flag |
|
||||
|
||||
**When to Use:**
|
||||
- Detecting trading halts
|
||||
- Understanding market status changes
|
||||
- Filtering data by trading status
|
||||
|
||||
## Imbalance Schema
|
||||
|
||||
Order imbalance data for auctions and closes.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `ts_recv` | uint64 | Receipt timestamp |
|
||||
| `ts_event` | uint64 | Event timestamp |
|
||||
| `ref_price` | int64 | Reference price |
|
||||
| `auction_time` | uint64 | Auction timestamp |
|
||||
| `cont_book_clr_price` | int64 | Continuous book clearing price |
|
||||
| `auct_interest_clr_price` | int64 | Auction interest clearing price |
|
||||
| `paired_qty` | uint64 | Paired quantity |
|
||||
| `total_imbalance_qty` | uint64 | Total imbalance |
|
||||
| `side` | char | Imbalance side ('B' or 'A') |
|
||||
| `significant_imbalance` | char | Significance flag |
|
||||
|
||||
**When to Use:**
|
||||
- Opening/closing auction analysis
|
||||
- Imbalance trading strategies
|
||||
- End-of-day positioning
|
||||
|
||||
## Schema Selection Decision Matrix
|
||||
|
||||
| Analysis Type | Recommended Schema | Alternative |
|
||||
|---------------|-------------------|-------------|
|
||||
| Daily backtesting | ohlcv-1d | ohlcv-1h |
|
||||
| Intraday backtesting | ohlcv-1h, ohlcv-1m | trades |
|
||||
| Spread analysis | mbp-1 | trades |
|
||||
| Order flow | trades | mbp-1 |
|
||||
| Market depth | mbp-10 | mbo |
|
||||
| Tick-by-tick | trades | mbo |
|
||||
| Liquidity analysis | mbp-1, mbp-10 | mbo |
|
||||
| Contract specifications | definition | - |
|
||||
| Settlement prices | statistics | definition |
|
||||
| Trading halts | status | - |
|
||||
| Auction analysis | imbalance | trades |
|
||||
|
||||
## Data Type Reference
|
||||
|
||||
### Fixed-Point Prices
|
||||
All price fields are stored as int64 in fixed-point notation with 9 decimal places of precision.
|
||||
|
||||
**Conversion:**
|
||||
```python
|
||||
decimal_price = int64_price / 1_000_000_000
|
||||
```
|
||||
|
||||
**Example:**
|
||||
- ES at 4500.25 → stored as 4500250000000
|
||||
- NQ at 15000.50 → stored as 15000500000000
|
||||
|
||||
### Timestamps
|
||||
All timestamps are uint64 nanoseconds since Unix epoch (1970-01-01 00:00:00 UTC).
|
||||
|
||||
**Conversion to datetime:**
|
||||
```python
|
||||
import datetime
|
||||
dt = datetime.datetime.fromtimestamp(ts_event / 1_000_000_000, tz=datetime.timezone.utc)
|
||||
```
|
||||
|
||||
### Character Fields
|
||||
Single-character fields (char) represent enums:
|
||||
- Action: 'A' (add), 'C' (cancel), 'M' (modify), 'T' (trade), 'F' (fill)
|
||||
- Side: 'B' (bid), 'A' (ask), 'N' (none/unknown)
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Schema Size (Approximate bytes per record)
|
||||
|
||||
| Schema | Size | Records/GB |
|
||||
|--------|------|------------|
|
||||
| ohlcv-1d | ~100 | ~10M |
|
||||
| ohlcv-1h | ~100 | ~10M |
|
||||
| trades | ~50 | ~20M |
|
||||
| mbp-1 | ~150 | ~6.7M |
|
||||
| mbp-10 | ~500 | ~2M |
|
||||
| mbo | ~80 | ~12.5M |
|
||||
|
||||
**Planning requests:**
|
||||
- 1 day of ES trades ≈ 100K-500K records ≈ 5-25 MB
|
||||
- 1 day of ES mbp-1 ≈ 1M-5M records ≈ 150-750 MB
|
||||
- 1 year of ES ohlcv-1h ≈ 6K records ≈ 600 KB
|
||||
|
||||
Use these estimates to decide between timeseries (< 5GB) and batch downloads (> 5GB).
|
||||
451
skills/databento/references/symbology.md
Normal file
451
skills/databento/references/symbology.md
Normal file
@@ -0,0 +1,451 @@
|
||||
# Databento Symbology Reference
|
||||
|
||||
Comprehensive guide to Databento's symbology system including continuous contracts, symbol types, and resolution strategies.
|
||||
|
||||
## Symbol Types (stypes)
|
||||
|
||||
Databento supports multiple symbology naming conventions. Use `mcp__databento__symbology_resolve` to convert between types.
|
||||
|
||||
### raw_symbol
|
||||
Native exchange symbols as provided by the venue.
|
||||
|
||||
**Examples:**
|
||||
- `ESH5` - ES March 2025 contract
|
||||
- `NQM5` - NQ June 2025 contract
|
||||
- `AAPL` - Apple Inc. stock
|
||||
- `SPY` - SPDR S&P 500 ETF
|
||||
|
||||
**When to use:**
|
||||
- Working with specific contract months
|
||||
- Exact symbol from exchange documentation
|
||||
- Historical analysis of specific expirations
|
||||
|
||||
**Limitations:**
|
||||
- Requires knowing exact contract codes
|
||||
- Different venues use different conventions
|
||||
- Doesn't handle roll automatically
|
||||
|
||||
### instrument_id
|
||||
Databento's internal numeric identifier for each instrument.
|
||||
|
||||
**Examples:**
|
||||
- `123456789` - Unique ID for ESH5
|
||||
- `987654321` - Unique ID for NQM5
|
||||
|
||||
**When to use:**
|
||||
- After symbol resolution
|
||||
- Internally within Databento system
|
||||
- When guaranteed uniqueness is required
|
||||
|
||||
**Benefits:**
|
||||
- Globally unique across all venues
|
||||
- Never changes for a given instrument
|
||||
- Most efficient for API requests
|
||||
|
||||
**Limitations:**
|
||||
- Not human-readable
|
||||
- Requires resolution step to obtain
|
||||
|
||||
### continuous
|
||||
Continuous contract notation with automatic rolling for futures.
|
||||
|
||||
**Format:** `{ROOT}.{STRATEGY}.{OFFSET}`
|
||||
|
||||
**Examples:**
|
||||
- `ES.c.0` - ES front month, calendar roll
|
||||
- `NQ.n.0` - NQ front month, open interest roll
|
||||
- `ES.v.1` - ES second month, volume roll
|
||||
- `GC.c.0` - Gold front month, calendar roll
|
||||
|
||||
**When to use:**
|
||||
- Backtesting across multiple expirations
|
||||
- Avoiding roll gaps in analysis
|
||||
- Long-term continuous price series
|
||||
|
||||
**Benefits:**
|
||||
- Automatic roll handling
|
||||
- Consistent symbology across time
|
||||
- Ideal for backtesting
|
||||
|
||||
### parent
|
||||
Parent contract symbols for options or complex instruments.
|
||||
|
||||
**Examples:**
|
||||
- `ES` - Parent for all ES contracts
|
||||
- `NQ` - Parent for all NQ contracts
|
||||
|
||||
**When to use:**
|
||||
- Options underlying symbols
|
||||
- Querying all contracts in a family
|
||||
- Getting contract family metadata
|
||||
|
||||
## Continuous Contract Deep Dive
|
||||
|
||||
Continuous contracts are the most powerful feature for futures analysis. They automatically handle contract rolls using different strategies.
|
||||
|
||||
### Roll Strategies
|
||||
|
||||
#### Calendar Roll (.c.X)
|
||||
Rolls on fixed calendar dates regardless of market activity.
|
||||
|
||||
**Notation:** `ES.c.0`, `NQ.c.1`
|
||||
|
||||
**Roll Timing:**
|
||||
- ES: Rolls 8 days before contract expiration
|
||||
- NQ: Rolls 8 days before contract expiration
|
||||
|
||||
**When to use:**
|
||||
- Standard backtesting
|
||||
- Most predictable roll schedule
|
||||
- When roll timing is less critical
|
||||
|
||||
**Pros:**
|
||||
- Predictable roll dates
|
||||
- Consistent across instruments
|
||||
- Simple to understand
|
||||
|
||||
**Cons:**
|
||||
- May roll during low liquidity
|
||||
- Doesn't consider market dynamics
|
||||
|
||||
#### Open Interest Roll (.n.X)
|
||||
Rolls when open interest moves to the next contract.
|
||||
|
||||
**Notation:** `ES.n.0`, `NQ.n.1`
|
||||
|
||||
**Roll Timing:**
|
||||
- Switches when next contract's OI > current contract's OI
|
||||
|
||||
**When to use:**
|
||||
- Avoiding early rolls
|
||||
- Following market participants
|
||||
- When market dynamics matter
|
||||
|
||||
**Pros:**
|
||||
- Follows market behavior
|
||||
- Natural transition point
|
||||
- Avoids artificial timing
|
||||
|
||||
**Cons:**
|
||||
- Less predictable timing
|
||||
- Can be delayed during low volume
|
||||
- Different instruments roll at different times
|
||||
|
||||
#### Volume Roll (.v.X)
|
||||
Rolls when trading volume moves to the next contract.
|
||||
|
||||
**Notation:** `ES.v.0`, `NQ.v.1`
|
||||
|
||||
**Roll Timing:**
|
||||
- Switches when next contract's volume > current contract's volume
|
||||
|
||||
**When to use:**
|
||||
- Following most liquid contract
|
||||
- High-frequency analysis
|
||||
- When execution quality matters
|
||||
|
||||
**Pros:**
|
||||
- Always in most liquid contract
|
||||
- Best for execution
|
||||
- Real-time liquidity tracking
|
||||
|
||||
**Cons:**
|
||||
- Most variable timing
|
||||
- Can switch back and forth
|
||||
- Requires careful validation
|
||||
|
||||
### Offset Parameter (.X)
|
||||
|
||||
The offset determines which contract month in the series.
|
||||
|
||||
| Offset | Description | Example Usage |
|
||||
|--------|-------------|---------------|
|
||||
| `.0` | Front month | Primary trading contract |
|
||||
| `.1` | Second month | Spread analysis vs front |
|
||||
| `.2` | Third month | Deferred spread analysis |
|
||||
| `.3+` | Further months | Calendar spread strategies |
|
||||
|
||||
**Common Patterns:**
|
||||
- `ES.c.0` - Standard ES continuous (front month)
|
||||
- `ES.c.0,ES.c.1` - ES calendar spread (front vs back)
|
||||
- `ES.c.0,NQ.c.0` - ES/NQ pair analysis
|
||||
|
||||
## ES/NQ Specific Symbology
|
||||
|
||||
### ES (E-mini S&P 500)
|
||||
|
||||
**Contract Months:** H (Mar), M (Jun), U (Sep), Z (Dec)
|
||||
|
||||
**Raw Symbol Format:** `ES{MONTH}{YEAR}`
|
||||
- `ESH5` = March 2025
|
||||
- `ESM5` = June 2025
|
||||
- `ESU5` = September 2025
|
||||
- `ESZ5` = December 2025
|
||||
|
||||
**Continuous Contracts:**
|
||||
- `ES.c.0` - Front month (most common)
|
||||
- `ES.n.0` - OI-based front month
|
||||
- `ES.v.0` - Volume-based front month
|
||||
|
||||
**Tick Size:** 0.25 points ($12.50 per tick)
|
||||
**Contract Multiplier:** $50 per point
|
||||
**Trading Hours:** Nearly 24 hours (Sunday 6pm - Friday 5pm ET)
|
||||
|
||||
### NQ (E-mini Nasdaq-100)
|
||||
|
||||
**Contract Months:** H (Mar), M (Jun), U (Sep), Z (Dec)
|
||||
|
||||
**Raw Symbol Format:** `NQ{MONTH}{YEAR}`
|
||||
- `NQH5` = March 2025
|
||||
- `NQM5` = June 2025
|
||||
- `NQU5` = September 2025
|
||||
- `NQZ5` = December 2025
|
||||
|
||||
**Continuous Contracts:**
|
||||
- `NQ.c.0` - Front month (most common)
|
||||
- `NQ.n.0` - OI-based front month
|
||||
- `NQ.v.0` - Volume-based front month
|
||||
|
||||
**Tick Size:** 0.25 points ($5.00 per tick)
|
||||
**Contract Multiplier:** $20 per point
|
||||
**Trading Hours:** Nearly 24 hours (Sunday 6pm - Friday 5pm ET)
|
||||
|
||||
### Month Codes Reference
|
||||
|
||||
| Code | Month | Typical Expiration |
|
||||
|------|-------|-------------------|
|
||||
| F | January | 3rd Friday |
|
||||
| G | February | 3rd Friday |
|
||||
| H | March | 3rd Friday |
|
||||
| J | April | 3rd Friday |
|
||||
| K | May | 3rd Friday |
|
||||
| M | June | 3rd Friday |
|
||||
| N | July | 3rd Friday |
|
||||
| Q | August | 3rd Friday |
|
||||
| U | September | 3rd Friday |
|
||||
| V | October | 3rd Friday |
|
||||
| X | November | 3rd Friday |
|
||||
| Z | December | 3rd Friday |
|
||||
|
||||
**Note:** ES/NQ only trade quarterly contracts (H, M, U, Z).
|
||||
|
||||
## Symbol Resolution
|
||||
|
||||
Use `mcp__databento__symbology_resolve` to convert between symbol types.
|
||||
|
||||
### Common Resolution Patterns
|
||||
|
||||
**Continuous to Instrument ID:**
|
||||
```
|
||||
Input: ES.c.0
|
||||
stype_in: continuous
|
||||
stype_out: instrument_id
|
||||
Result: Maps to current front month's instrument_id
|
||||
```
|
||||
|
||||
**Raw Symbol to Instrument ID:**
|
||||
```
|
||||
Input: ESH5
|
||||
stype_in: raw_symbol
|
||||
stype_out: instrument_id
|
||||
Result: Specific instrument_id for ESH5
|
||||
```
|
||||
|
||||
**Continuous to Raw Symbol:**
|
||||
```
|
||||
Input: ES.c.0
|
||||
stype_in: continuous
|
||||
stype_out: raw_symbol
|
||||
Result: Current front month symbol (e.g., ESH5)
|
||||
```
|
||||
|
||||
### Time-Based Resolution
|
||||
|
||||
Symbol resolution is **date-dependent**. The same continuous contract resolves to different instruments across time.
|
||||
|
||||
**Example:**
|
||||
- `ES.c.0` on 2024-01-15 → ESH4 (March 2024)
|
||||
- `ES.c.0` on 2024-04-15 → ESM4 (June 2024)
|
||||
- `ES.c.0` on 2024-07-15 → ESU4 (September 2024)
|
||||
|
||||
**Important:** Always specify `start_date` and `end_date` when resolving symbols for historical analysis.
|
||||
|
||||
### Resolution Parameters
|
||||
|
||||
```
|
||||
mcp__databento__symbology_resolve
|
||||
- dataset: "GLBX.MDP3"
|
||||
- symbols: ["ES.c.0", "NQ.c.0"]
|
||||
- stype_in: "continuous"
|
||||
- stype_out: "instrument_id"
|
||||
- start_date: "2024-01-01"
|
||||
- end_date: "2024-12-31"
|
||||
```
|
||||
|
||||
Returns mapping of continuous symbols to instrument IDs for each day in the range.
|
||||
|
||||
## Expiration Handling
|
||||
|
||||
### Roll Dates
|
||||
|
||||
ES/NQ contracts expire on the **3rd Friday of the contract month** at 9:30 AM ET.
|
||||
|
||||
**Calendar Roll (.c.0) Schedule:**
|
||||
- Rolls **8 days before expiration**
|
||||
- Always rolls on the same relative day
|
||||
- Predictable for backtesting
|
||||
|
||||
**Example for ESH5 (March 2025):**
|
||||
- Expiration: Friday, March 21, 2025
|
||||
- Calendar roll: March 13, 2025 (8 days before)
|
||||
|
||||
### Roll Detection
|
||||
|
||||
To detect when a continuous contract rolled, compare instrument_id or raw_symbol across consecutive timestamps.
|
||||
|
||||
**Example:**
|
||||
```
|
||||
2024-03-12: ES.c.0 → ESH4
|
||||
2024-03-13: ES.c.0 → ESM4 (rolled!)
|
||||
```
|
||||
|
||||
### Handling Roll Gaps
|
||||
|
||||
Price discontinuities often occur at roll:
|
||||
|
||||
**Gap Detection:**
|
||||
```
|
||||
if abs(close_before_roll - open_after_roll) > threshold:
|
||||
# Roll gap detected
|
||||
```
|
||||
|
||||
**Adjustment Strategies:**
|
||||
1. **Ratio Adjustment:** Multiply historical prices by ratio
|
||||
2. **Difference Adjustment:** Add/subtract difference
|
||||
3. **No Adjustment:** Keep raw prices (most common for futures)
|
||||
|
||||
For ES/NQ futures, **no adjustment** is standard since contracts are similar.
|
||||
|
||||
## Symbol Validation
|
||||
|
||||
### Valid Symbol Patterns
|
||||
|
||||
**Continuous:**
|
||||
- Must match: `{ROOT}.{c|n|v}.{0-9+}`
|
||||
- Examples: `ES.c.0`, `NQ.n.1`, `GC.v.0`
|
||||
|
||||
**Raw Symbols (Futures):**
|
||||
- Must match: `{ROOT}{MONTH_CODE}{YEAR}`
|
||||
- Examples: `ESH5`, `NQZ4`, `GCM6`
|
||||
|
||||
**Equity Symbols:**
|
||||
- 1-5 uppercase letters
|
||||
- Examples: `AAPL`, `MSFT`, `SPY`, `GOOGL`
|
||||
|
||||
### Symbol Existence Validation
|
||||
|
||||
Before using a symbol, validate it exists in the dataset:
|
||||
|
||||
1. Use `mcp__databento__symbology_resolve` to resolve
|
||||
2. Use `mcp__databento__reference_search_securities` for metadata
|
||||
3. Check definition schema for instrument details
|
||||
|
||||
## Common Symbol Pitfalls
|
||||
|
||||
### 1. Wrong stype_in for Continuous Contracts
|
||||
**Wrong:**
|
||||
```
|
||||
symbols: "ES.c.0"
|
||||
stype_in: "raw_symbol" # WRONG!
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```
|
||||
symbols: "ES.c.0"
|
||||
stype_in: "continuous" # CORRECT
|
||||
```
|
||||
|
||||
### 2. Forgetting Date Range for Resolution
|
||||
**Wrong:**
|
||||
```
|
||||
symbology_resolve(symbols=["ES.c.0"], start_date="2024-01-01")
|
||||
# Missing end_date - only resolves for one day
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```
|
||||
symbology_resolve(symbols=["ES.c.0"], start_date="2024-01-01", end_date="2024-12-31")
|
||||
# Resolves for entire year
|
||||
```
|
||||
|
||||
### 3. Using Expired Contracts
|
||||
**Wrong:**
|
||||
```
|
||||
# ESH4 expired in March 2024
|
||||
symbols: "ESH4"
|
||||
start_date: "2024-06-01" # After expiration!
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```
|
||||
# Use continuous contract
|
||||
symbols: "ES.c.0"
|
||||
start_date: "2024-06-01" # Automatically maps to ESM4
|
||||
```
|
||||
|
||||
### 4. Mixing Symbol Types
|
||||
**Wrong:**
|
||||
```
|
||||
symbols: "ES.c.0,ESH5,123456" # Mixed types!
|
||||
```
|
||||
|
||||
**Correct:**
|
||||
```
|
||||
# Resolve separately or use same type
|
||||
symbols: "ES.c.0,NQ.c.0" # All continuous
|
||||
```
|
||||
|
||||
## Symbol Best Practices
|
||||
|
||||
1. **Use continuous contracts for backtesting** - Avoids manual roll management
|
||||
2. **Prefer calendar rolls (.c.X) unless specific reason** - Most predictable
|
||||
3. **Always validate symbols exist** - Use symbology_resolve before fetching data
|
||||
4. **Specify date ranges for resolution** - Symbol meanings change over time
|
||||
5. **Use instrument_id after resolution** - Most efficient for API calls
|
||||
6. **Document roll strategy** - Know which roll type (.c/.n/.v) you're using
|
||||
7. **Test around roll dates** - Verify behavior during contract transitions
|
||||
8. **Cache symbol mappings** - Don't re-resolve repeatedly
|
||||
|
||||
## Quick Reference: Common Symbols
|
||||
|
||||
### ES/NQ Continuous (Most Common)
|
||||
```
|
||||
ES.c.0 # ES front month, calendar roll
|
||||
NQ.c.0 # NQ front month, calendar roll
|
||||
ES.c.1 # ES second month
|
||||
NQ.c.1 # NQ second month
|
||||
```
|
||||
|
||||
### ES/NQ Specific Contracts (2025)
|
||||
```
|
||||
ESH5 # ES March 2025
|
||||
ESM5 # ES June 2025
|
||||
ESU5 # ES September 2025
|
||||
ESZ5 # ES December 2025
|
||||
|
||||
NQH5 # NQ March 2025
|
||||
NQM5 # NQ June 2025
|
||||
NQU5 # NQ September 2025
|
||||
NQZ5 # NQ December 2025
|
||||
```
|
||||
|
||||
### Equity Market Breadth (Supporting ES/NQ Analysis)
|
||||
```
|
||||
SPY # SPDR S&P 500 ETF
|
||||
QQQ # Invesco QQQ (Nasdaq-100 ETF)
|
||||
VIX # CBOE Volatility Index
|
||||
TICK # NYSE TICK
|
||||
VOLD # NYSE Volume Delta
|
||||
```
|
||||
|
||||
For equity symbols, use dataset `XNAS.ITCH` (Nasdaq) or other appropriate equity dataset.
|
||||
345
skills/databento/scripts/fetch_ohlcv.py
Normal file
345
skills/databento/scripts/fetch_ohlcv.py
Normal file
@@ -0,0 +1,345 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Databento OHLCV Data Fetcher
|
||||
|
||||
Standard pattern for fetching OHLCV data with built-in best practices:
|
||||
- Automatic cost estimation before fetch
|
||||
- Error handling with retries
|
||||
- Post-fetch data validation
|
||||
- Export options (CSV/pandas)
|
||||
|
||||
Usage:
|
||||
python fetch_ohlcv.py --symbol ES.c.0 --schema ohlcv-1h --start 2024-01-01 --end 2024-01-31
|
||||
python fetch_ohlcv.py --symbol NQ.c.0 --schema ohlcv-1d --start 2024-01-01 --limit 100
|
||||
python fetch_ohlcv.py --symbol ES.c.0,NQ.c.0 --schema ohlcv-1h --start 2024-01-01 --output data.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from typing import Optional, Dict, Any, List
|
||||
import time
|
||||
|
||||
|
||||
class DatabentoPHTLCVFetcher:
|
||||
"""Fetches OHLCV data from Databento with best practices built-in."""
|
||||
|
||||
def __init__(self, dataset: str = "GLBX.MDP3", stype_in: str = "continuous"):
|
||||
"""
|
||||
Initialize fetcher.
|
||||
|
||||
Args:
|
||||
dataset: Dataset code (default: GLBX.MDP3 for ES/NQ)
|
||||
stype_in: Input symbol type (default: continuous)
|
||||
"""
|
||||
self.dataset = dataset
|
||||
self.stype_in = stype_in
|
||||
self.max_retries = 3
|
||||
self.retry_delay = 2 # seconds
|
||||
|
||||
def estimate_cost(
|
||||
self,
|
||||
symbols: str,
|
||||
schema: str,
|
||||
start: str,
|
||||
end: Optional[str] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Estimate cost before fetching data.
|
||||
|
||||
Args:
|
||||
symbols: Comma-separated symbol list
|
||||
schema: Data schema (e.g., ohlcv-1h)
|
||||
start: Start date (YYYY-MM-DD)
|
||||
end: End date (optional)
|
||||
|
||||
Returns:
|
||||
Cost estimation result
|
||||
"""
|
||||
print(f"[COST CHECK] Estimating cost for {symbols} ({schema})...")
|
||||
|
||||
# NOTE: In actual usage, this would call the MCP tool:
|
||||
# mcp__databento__metadata_get_cost(
|
||||
# dataset=self.dataset,
|
||||
# start=start,
|
||||
# end=end,
|
||||
# symbols=symbols,
|
||||
# schema=schema,
|
||||
# stype_in=self.stype_in
|
||||
# )
|
||||
|
||||
# For this template, we simulate the response
|
||||
print("[NOTE] This template script demonstrates the pattern.")
|
||||
print("[NOTE] In actual usage, integrate with MCP tools directly.")
|
||||
|
||||
return {
|
||||
"estimated_cost_usd": 0.0,
|
||||
"estimated_size_mb": 0.0,
|
||||
"note": "Call mcp__databento__metadata_get_cost here"
|
||||
}
|
||||
|
||||
def validate_dataset_range(self) -> Dict[str, str]:
|
||||
"""
|
||||
Validate dataset availability.
|
||||
|
||||
Returns:
|
||||
Dataset date range
|
||||
"""
|
||||
print(f"[VALIDATION] Checking dataset availability for {self.dataset}...")
|
||||
|
||||
# NOTE: In actual usage, this would call:
|
||||
# mcp__databento__metadata_get_dataset_range(dataset=self.dataset)
|
||||
|
||||
return {
|
||||
"start_date": "2000-01-01",
|
||||
"end_date": datetime.now().strftime("%Y-%m-%d"),
|
||||
"note": "Call mcp__databento__metadata_get_dataset_range here"
|
||||
}
|
||||
|
||||
def fetch_data(
|
||||
self,
|
||||
symbols: str,
|
||||
schema: str,
|
||||
start: str,
|
||||
end: Optional[str] = None,
|
||||
limit: Optional[int] = None,
|
||||
check_cost: bool = True
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Fetch OHLCV data with retries and error handling.
|
||||
|
||||
Args:
|
||||
symbols: Comma-separated symbol list
|
||||
schema: Data schema (e.g., ohlcv-1h, ohlcv-1d)
|
||||
start: Start date (YYYY-MM-DD)
|
||||
end: End date (optional)
|
||||
limit: Maximum number of records (optional)
|
||||
check_cost: Whether to check cost before fetching (default: True)
|
||||
|
||||
Returns:
|
||||
Fetched data
|
||||
"""
|
||||
# Step 1: Cost check (if enabled)
|
||||
if check_cost:
|
||||
cost_info = self.estimate_cost(symbols, schema, start, end)
|
||||
print(f"[COST] Estimated cost: ${cost_info.get('estimated_cost_usd', 0):.2f}")
|
||||
print(f"[COST] Estimated size: {cost_info.get('estimated_size_mb', 0):.2f} MB")
|
||||
|
||||
# Prompt for confirmation if cost is high
|
||||
estimated_cost = cost_info.get('estimated_cost_usd', 0)
|
||||
if estimated_cost > 10:
|
||||
response = input(f"\nEstimated cost is ${estimated_cost:.2f}. Continue? (y/n): ")
|
||||
if response.lower() != 'y':
|
||||
print("[CANCELLED] Data fetch cancelled by user.")
|
||||
sys.exit(0)
|
||||
|
||||
# Step 2: Validate dataset
|
||||
dataset_range = self.validate_dataset_range()
|
||||
print(f"[DATASET] Available range: {dataset_range.get('start_date')} to {dataset_range.get('end_date')}")
|
||||
|
||||
# Step 3: Fetch data with retries
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
print(f"\n[FETCH] Attempt {attempt + 1}/{self.max_retries}")
|
||||
print(f"[FETCH] Fetching {symbols} ({schema}) from {start} to {end or 'now'}...")
|
||||
|
||||
# NOTE: In actual usage, this would call:
|
||||
# data = mcp__databento__timeseries_get_range(
|
||||
# dataset=self.dataset,
|
||||
# symbols=symbols,
|
||||
# schema=schema,
|
||||
# start=start,
|
||||
# end=end,
|
||||
# stype_in=self.stype_in,
|
||||
# stype_out="instrument_id",
|
||||
# limit=limit
|
||||
# )
|
||||
|
||||
# Simulate successful fetch
|
||||
print("[SUCCESS] Data fetched successfully!")
|
||||
return {
|
||||
"data": [],
|
||||
"record_count": 0,
|
||||
"note": "Call mcp__databento__timeseries_get_range here"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"[ERROR] Attempt {attempt + 1} failed: {str(e)}")
|
||||
|
||||
if attempt < self.max_retries - 1:
|
||||
print(f"[RETRY] Waiting {self.retry_delay} seconds before retry...")
|
||||
time.sleep(self.retry_delay)
|
||||
else:
|
||||
print("[FAILED] All retry attempts exhausted.")
|
||||
raise
|
||||
|
||||
def validate_data(self, data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Validate fetched data quality.
|
||||
|
||||
Args:
|
||||
data: Fetched data
|
||||
|
||||
Returns:
|
||||
Validation report
|
||||
"""
|
||||
print("\n[VALIDATION] Running data quality checks...")
|
||||
|
||||
# NOTE: Actual validation would:
|
||||
# - Check for timestamp gaps
|
||||
# - Verify record counts
|
||||
# - Validate price ranges
|
||||
# - Check for duplicates
|
||||
|
||||
# Use scripts/validate_data.py for comprehensive validation
|
||||
|
||||
return {
|
||||
"valid": True,
|
||||
"record_count": data.get("record_count", 0),
|
||||
"issues": [],
|
||||
"note": "Use scripts/validate_data.py for detailed validation"
|
||||
}
|
||||
|
||||
def export_csv(self, data: Dict[str, Any], output_path: str):
|
||||
"""
|
||||
Export data to CSV.
|
||||
|
||||
Args:
|
||||
data: Data to export
|
||||
output_path: Output file path
|
||||
"""
|
||||
print(f"\n[EXPORT] Saving data to {output_path}...")
|
||||
|
||||
# NOTE: Actual export would convert data to CSV format
|
||||
# and write to file
|
||||
|
||||
print(f"[SUCCESS] Data saved to {output_path}")
|
||||
|
||||
def export_json(self, data: Dict[str, Any], output_path: str):
|
||||
"""
|
||||
Export data to JSON.
|
||||
|
||||
Args:
|
||||
data: Data to export
|
||||
output_path: Output file path
|
||||
"""
|
||||
print(f"\n[EXPORT] Saving data to {output_path}...")
|
||||
|
||||
with open(output_path, 'w') as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
print(f"[SUCCESS] Data saved to {output_path}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point for CLI usage."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Fetch OHLCV data from Databento with best practices"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--symbol",
|
||||
"-s",
|
||||
required=True,
|
||||
help="Symbol or comma-separated symbols (e.g., ES.c.0 or ES.c.0,NQ.c.0)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--schema",
|
||||
choices=["ohlcv-1s", "ohlcv-1m", "ohlcv-1h", "ohlcv-1d", "ohlcv-eod"],
|
||||
default="ohlcv-1h",
|
||||
help="OHLCV schema (default: ohlcv-1h)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--start",
|
||||
required=True,
|
||||
help="Start date (YYYY-MM-DD)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--end",
|
||||
help="End date (YYYY-MM-DD, optional)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--limit",
|
||||
type=int,
|
||||
help="Maximum number of records (optional)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--dataset",
|
||||
default="GLBX.MDP3",
|
||||
help="Dataset code (default: GLBX.MDP3)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--stype-in",
|
||||
default="continuous",
|
||||
choices=["continuous", "raw_symbol", "instrument_id"],
|
||||
help="Input symbol type (default: continuous)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
"-o",
|
||||
help="Output file path (CSV or JSON based on extension)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--no-cost-check",
|
||||
action="store_true",
|
||||
help="Skip cost estimation (not recommended)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create fetcher
|
||||
fetcher = DatabentOHLCVFetcher(
|
||||
dataset=args.dataset,
|
||||
stype_in=args.stype_in
|
||||
)
|
||||
|
||||
try:
|
||||
# Fetch data
|
||||
data = fetcher.fetch_data(
|
||||
symbols=args.symbol,
|
||||
schema=args.schema,
|
||||
start=args.start,
|
||||
end=args.end,
|
||||
limit=args.limit,
|
||||
check_cost=not args.no_cost_check
|
||||
)
|
||||
|
||||
# Validate data
|
||||
validation = fetcher.validate_data(data)
|
||||
print(f"\n[VALIDATION] Data is valid: {validation['valid']}")
|
||||
print(f"[VALIDATION] Record count: {validation['record_count']}")
|
||||
|
||||
if validation['issues']:
|
||||
print(f"[WARNING] Issues found: {validation['issues']}")
|
||||
|
||||
# Export if output specified
|
||||
if args.output:
|
||||
if args.output.endswith('.csv'):
|
||||
fetcher.export_csv(data, args.output)
|
||||
elif args.output.endswith('.json'):
|
||||
fetcher.export_json(data, args.output)
|
||||
else:
|
||||
print("[WARNING] Unknown output format. Saving as JSON.")
|
||||
fetcher.export_json(data, args.output + '.json')
|
||||
|
||||
print("\n[DONE] Fetch complete!")
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n[CANCELLED] Fetch cancelled by user.")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"\n[ERROR] Fetch failed: {str(e)}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
388
skills/databento/scripts/session_filter.py
Normal file
388
skills/databento/scripts/session_filter.py
Normal file
@@ -0,0 +1,388 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Databento Trading Session Filter
|
||||
|
||||
Filter market data by trading session (Asian/London/NY):
|
||||
- Session detection using get_session_info
|
||||
- Historical data filtering by session
|
||||
- Session transition handling
|
||||
- Session-specific statistics
|
||||
|
||||
Usage:
|
||||
python session_filter.py --input data.json --session NY --output ny_session.json
|
||||
python session_filter.py --input data.json --session London --stats
|
||||
python session_filter.py --input data.json --sessions Asian,London --output combined.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from typing import Dict, List, Any, Optional, Tuple
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class TradingSession(Enum):
|
||||
"""Trading session definitions (in ET)."""
|
||||
ASIAN = ("Asian", 18, 2) # 6pm - 2am ET
|
||||
LONDON = ("London", 2, 8) # 2am - 8am ET
|
||||
NY = ("NY", 8, 16) # 8am - 4pm ET
|
||||
|
||||
|
||||
class SessionFilter:
|
||||
"""Filters Databento market data by trading session."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize session filter."""
|
||||
self.sessions = {
|
||||
"Asian": TradingSession.ASIAN,
|
||||
"London": TradingSession.LONDON,
|
||||
"NY": TradingSession.NY
|
||||
}
|
||||
|
||||
def get_current_session(self, timestamp: Optional[str] = None) -> str:
|
||||
"""
|
||||
Get trading session for a timestamp.
|
||||
|
||||
Args:
|
||||
timestamp: ISO timestamp (optional, defaults to now)
|
||||
|
||||
Returns:
|
||||
Session name (Asian, London, or NY)
|
||||
"""
|
||||
# NOTE: In actual usage, this would call:
|
||||
# session_info = mcp__databento__get_session_info(timestamp=timestamp)
|
||||
# return session_info["session"]
|
||||
|
||||
# For this template, simulate session detection
|
||||
if timestamp:
|
||||
dt = datetime.fromisoformat(timestamp.replace('Z', '+00:00'))
|
||||
else:
|
||||
dt = datetime.now(timezone.utc)
|
||||
|
||||
# Convert to ET
|
||||
et_hour = (dt.hour - 5) % 24 # Simplified ET conversion
|
||||
|
||||
# Determine session
|
||||
if 18 <= et_hour or et_hour < 2:
|
||||
return "Asian"
|
||||
elif 2 <= et_hour < 8:
|
||||
return "London"
|
||||
else:
|
||||
return "NY"
|
||||
|
||||
def is_in_session(
|
||||
self,
|
||||
timestamp_ns: int,
|
||||
session: TradingSession
|
||||
) -> bool:
|
||||
"""
|
||||
Check if timestamp falls within trading session.
|
||||
|
||||
Args:
|
||||
timestamp_ns: Timestamp in nanoseconds
|
||||
session: Trading session to check
|
||||
|
||||
Returns:
|
||||
True if timestamp is in session
|
||||
"""
|
||||
# Convert nanoseconds to datetime
|
||||
ts_seconds = timestamp_ns / 1_000_000_000
|
||||
dt = datetime.fromtimestamp(ts_seconds, tz=timezone.utc)
|
||||
|
||||
# Convert to ET (simplified, doesn't handle DST)
|
||||
et_offset = timedelta(hours=-5)
|
||||
dt_et = dt + et_offset
|
||||
|
||||
hour = dt_et.hour
|
||||
|
||||
# Check if hour falls within session
|
||||
_, start_hour, end_hour = session.value
|
||||
|
||||
if start_hour < end_hour:
|
||||
# Session doesn't cross midnight
|
||||
return start_hour <= hour < end_hour
|
||||
else:
|
||||
# Session crosses midnight (Asian session)
|
||||
return hour >= start_hour or hour < end_hour
|
||||
|
||||
def filter_by_session(
|
||||
self,
|
||||
data: List[Dict[str, Any]],
|
||||
sessions: List[str]
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Filter data to include only specified sessions.
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
sessions: List of session names to include
|
||||
|
||||
Returns:
|
||||
Filtered data
|
||||
"""
|
||||
print(f"[FILTER] Filtering {len(data)} records for sessions: {', '.join(sessions)}")
|
||||
|
||||
session_enums = [self.sessions[s] for s in sessions]
|
||||
filtered = []
|
||||
|
||||
for record in data:
|
||||
# Extract timestamp
|
||||
ts_ns = record.get("ts_event") or record.get("ts_recv") or record.get("timestamp")
|
||||
|
||||
if not ts_ns:
|
||||
continue
|
||||
|
||||
# Check if in any of the specified sessions
|
||||
for session in session_enums:
|
||||
if self.is_in_session(int(ts_ns), session):
|
||||
filtered.append(record)
|
||||
break
|
||||
|
||||
print(f"[FILTER] Kept {len(filtered)} records ({len(filtered)/len(data)*100:.1f}%)")
|
||||
return filtered
|
||||
|
||||
def calculate_session_stats(
|
||||
self,
|
||||
data: List[Dict[str, Any]]
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Calculate statistics by trading session.
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
|
||||
Returns:
|
||||
Session statistics
|
||||
"""
|
||||
print(f"[STATS] Calculating session statistics for {len(data)} records...")
|
||||
|
||||
stats = {
|
||||
"Asian": {"count": 0, "volume": 0, "trades": 0},
|
||||
"London": {"count": 0, "volume": 0, "trades": 0},
|
||||
"NY": {"count": 0, "volume": 0, "trades": 0}
|
||||
}
|
||||
|
||||
for record in data:
|
||||
ts_ns = record.get("ts_event") or record.get("ts_recv") or record.get("timestamp")
|
||||
|
||||
if not ts_ns:
|
||||
continue
|
||||
|
||||
# Determine session
|
||||
for session_name, session_enum in self.sessions.items():
|
||||
if self.is_in_session(int(ts_ns), session_enum):
|
||||
stats[session_name]["count"] += 1
|
||||
|
||||
# Add volume if available
|
||||
if "volume" in record:
|
||||
stats[session_name]["volume"] += record["volume"]
|
||||
|
||||
# Count trades
|
||||
if "size" in record: # Trade record
|
||||
stats[session_name]["trades"] += 1
|
||||
|
||||
break
|
||||
|
||||
# Calculate percentages
|
||||
total_count = sum(s["count"] for s in stats.values())
|
||||
for session_stats in stats.values():
|
||||
if total_count > 0:
|
||||
session_stats["percentage"] = (session_stats["count"] / total_count) * 100
|
||||
else:
|
||||
session_stats["percentage"] = 0
|
||||
|
||||
return stats
|
||||
|
||||
def filter_session_transitions(
|
||||
self,
|
||||
data: List[Dict[str, Any]],
|
||||
minutes_before: int = 30,
|
||||
minutes_after: int = 30
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Filter data to include only session transitions (handoffs).
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
minutes_before: Minutes before transition to include
|
||||
minutes_after: Minutes after transition to include
|
||||
|
||||
Returns:
|
||||
Filtered data around session transitions
|
||||
"""
|
||||
print(f"[FILTER] Extracting session transitions ({minutes_before}m before, {minutes_after}m after)...")
|
||||
|
||||
# Session transition times (in ET)
|
||||
transitions = [
|
||||
2, # Asian → London (2am ET)
|
||||
8, # London → NY (8am ET)
|
||||
16, # NY → Post-market
|
||||
18, # Post-market → Asian (6pm ET)
|
||||
]
|
||||
|
||||
filtered = []
|
||||
transition_window = timedelta(minutes=minutes_before + minutes_after)
|
||||
|
||||
for record in data:
|
||||
ts_ns = record.get("ts_event") or record.get("ts_recv") or record.get("timestamp")
|
||||
|
||||
if not ts_ns:
|
||||
continue
|
||||
|
||||
# Convert to ET hour
|
||||
ts_seconds = int(ts_ns) / 1_000_000_000
|
||||
dt = datetime.fromtimestamp(ts_seconds, tz=timezone.utc)
|
||||
et_offset = timedelta(hours=-5)
|
||||
dt_et = dt + et_offset
|
||||
|
||||
# Check if near any transition
|
||||
for transition_hour in transitions:
|
||||
transition_dt = dt_et.replace(hour=transition_hour, minute=0, second=0, microsecond=0)
|
||||
|
||||
# Calculate time difference
|
||||
time_diff = abs((dt_et - transition_dt).total_seconds())
|
||||
|
||||
# Include if within window
|
||||
if time_diff <= transition_window.total_seconds():
|
||||
filtered.append(record)
|
||||
break
|
||||
|
||||
print(f"[FILTER] Found {len(filtered)} records near session transitions")
|
||||
return filtered
|
||||
|
||||
def print_session_stats(self, stats: Dict[str, Any]):
|
||||
"""Print session statistics to console."""
|
||||
print("\n" + "=" * 60)
|
||||
print("SESSION STATISTICS")
|
||||
print("=" * 60)
|
||||
|
||||
for session_name in ["Asian", "London", "NY"]:
|
||||
session_stats = stats[session_name]
|
||||
print(f"\n{session_name} Session:")
|
||||
print(f" Records: {session_stats['count']:,} ({session_stats['percentage']:.1f}%)")
|
||||
if session_stats['volume'] > 0:
|
||||
print(f" Volume: {session_stats['volume']:,}")
|
||||
if session_stats['trades'] > 0:
|
||||
print(f" Trades: {session_stats['trades']:,}")
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point for CLI usage."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Filter Databento data by trading session"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--input",
|
||||
"-i",
|
||||
required=True,
|
||||
help="Input data file (JSON)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--session",
|
||||
"--sessions",
|
||||
help="Session(s) to filter (Asian, London, NY). Comma-separated for multiple."
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--transitions",
|
||||
action="store_true",
|
||||
help="Filter for session transition periods only"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--minutes-before",
|
||||
type=int,
|
||||
default=30,
|
||||
help="Minutes before transition (default: 30)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--minutes-after",
|
||||
type=int,
|
||||
default=30,
|
||||
help="Minutes after transition (default: 30)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--stats",
|
||||
action="store_true",
|
||||
help="Calculate and display session statistics"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
"-o",
|
||||
help="Output file for filtered data (JSON)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load data
|
||||
print(f"[LOAD] Loading data from {args.input}...")
|
||||
with open(args.input, 'r') as f:
|
||||
data = json.load(f)
|
||||
|
||||
# Handle different data formats
|
||||
if isinstance(data, dict) and "data" in data:
|
||||
data = data["data"]
|
||||
|
||||
# Create filter
|
||||
session_filter = SessionFilter()
|
||||
|
||||
# Calculate stats if requested
|
||||
if args.stats:
|
||||
stats = session_filter.calculate_session_stats(data)
|
||||
session_filter.print_session_stats(stats)
|
||||
|
||||
# Filter data
|
||||
filtered_data = data
|
||||
|
||||
if args.transitions:
|
||||
# Filter for session transitions
|
||||
filtered_data = session_filter.filter_session_transitions(
|
||||
filtered_data,
|
||||
minutes_before=args.minutes_before,
|
||||
minutes_after=args.minutes_after
|
||||
)
|
||||
elif args.session:
|
||||
# Filter by specific session(s)
|
||||
sessions = [s.strip() for s in args.session.split(',')]
|
||||
|
||||
# Validate sessions
|
||||
for session in sessions:
|
||||
if session not in ["Asian", "London", "NY"]:
|
||||
print(f"[ERROR] Invalid session: {session}")
|
||||
print("[ERROR] Valid sessions: Asian, London, NY")
|
||||
sys.exit(1)
|
||||
|
||||
filtered_data = session_filter.filter_by_session(filtered_data, sessions)
|
||||
|
||||
# Save filtered data if output specified
|
||||
if args.output:
|
||||
print(f"\n[SAVE] Saving {len(filtered_data)} filtered records to {args.output}...")
|
||||
|
||||
output_data = {
|
||||
"data": filtered_data,
|
||||
"metadata": {
|
||||
"original_count": len(data),
|
||||
"filtered_count": len(filtered_data),
|
||||
"filter_type": "transitions" if args.transitions else "sessions",
|
||||
"sessions": args.session.split(',') if args.session else None
|
||||
}
|
||||
}
|
||||
|
||||
with open(args.output, 'w') as f:
|
||||
json.dump(output_data, f, indent=2)
|
||||
|
||||
print(f"[SUCCESS] Filtered data saved!")
|
||||
|
||||
print("\n[DONE] Session filtering complete!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
496
skills/databento/scripts/validate_data.py
Normal file
496
skills/databento/scripts/validate_data.py
Normal file
@@ -0,0 +1,496 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Databento Data Quality Validator
|
||||
|
||||
Validates market data quality to catch issues early:
|
||||
- Timestamp gap detection
|
||||
- Record count verification
|
||||
- Price range validation (no negative prices, outliers)
|
||||
- Duplicate timestamp detection
|
||||
- Summary quality report
|
||||
|
||||
Usage:
|
||||
python validate_data.py --input data.json
|
||||
python validate_data.py --input data.csv --schema ohlcv-1h
|
||||
python validate_data.py --input data.json --max-gap-minutes 60 --report report.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, List, Any, Optional, Tuple
|
||||
from collections import defaultdict
|
||||
|
||||
|
||||
class DataValidator:
|
||||
"""Validates Databento market data quality."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
schema: str,
|
||||
max_gap_minutes: int = 60,
|
||||
price_outlier_std: float = 10.0
|
||||
):
|
||||
"""
|
||||
Initialize validator.
|
||||
|
||||
Args:
|
||||
schema: Data schema (ohlcv-1h, trades, mbp-1, etc.)
|
||||
max_gap_minutes: Maximum acceptable gap in minutes
|
||||
price_outlier_std: Standard deviations for outlier detection
|
||||
"""
|
||||
self.schema = schema
|
||||
self.max_gap_seconds = max_gap_minutes * 60
|
||||
self.price_outlier_std = price_outlier_std
|
||||
self.issues: List[Dict[str, Any]] = []
|
||||
|
||||
def validate(self, data: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""
|
||||
Run all validation checks on data.
|
||||
|
||||
Args:
|
||||
data: List of records to validate
|
||||
|
||||
Returns:
|
||||
Validation report
|
||||
"""
|
||||
print(f"[VALIDATION] Running quality checks on {len(data)} records...")
|
||||
|
||||
report = {
|
||||
"total_records": len(data),
|
||||
"valid": True,
|
||||
"checks": {}
|
||||
}
|
||||
|
||||
if not data:
|
||||
print("[WARNING] No data to validate!")
|
||||
report["valid"] = False
|
||||
return report
|
||||
|
||||
# Run all validation checks
|
||||
report["checks"]["timestamp_gaps"] = self.check_timestamp_gaps(data)
|
||||
report["checks"]["duplicates"] = self.check_duplicates(data)
|
||||
report["checks"]["price_range"] = self.check_price_range(data)
|
||||
report["checks"]["record_count"] = self.check_record_count(data)
|
||||
report["checks"]["data_completeness"] = self.check_completeness(data)
|
||||
|
||||
# Overall validity
|
||||
report["valid"] = all(
|
||||
check.get("valid", True)
|
||||
for check in report["checks"].values()
|
||||
)
|
||||
|
||||
report["issues"] = self.issues
|
||||
|
||||
return report
|
||||
|
||||
def check_timestamp_gaps(self, data: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""
|
||||
Check for unexpected gaps in timestamps.
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
|
||||
Returns:
|
||||
Gap check report
|
||||
"""
|
||||
print("[CHECK] Checking for timestamp gaps...")
|
||||
|
||||
gaps = []
|
||||
timestamps = self._extract_timestamps(data)
|
||||
|
||||
if len(timestamps) < 2:
|
||||
return {"valid": True, "gaps": [], "note": "Insufficient data for gap detection"}
|
||||
|
||||
# Sort timestamps
|
||||
sorted_ts = sorted(timestamps)
|
||||
|
||||
# Check gaps between consecutive timestamps
|
||||
for i in range(len(sorted_ts) - 1):
|
||||
gap_ns = sorted_ts[i + 1] - sorted_ts[i]
|
||||
gap_seconds = gap_ns / 1_000_000_000
|
||||
|
||||
if gap_seconds > self.max_gap_seconds:
|
||||
gap_info = {
|
||||
"index": i,
|
||||
"gap_seconds": gap_seconds,
|
||||
"gap_minutes": gap_seconds / 60,
|
||||
"before": self._format_timestamp(sorted_ts[i]),
|
||||
"after": self._format_timestamp(sorted_ts[i + 1])
|
||||
}
|
||||
gaps.append(gap_info)
|
||||
|
||||
self.issues.append({
|
||||
"type": "timestamp_gap",
|
||||
"severity": "warning",
|
||||
"message": f"Gap of {gap_seconds / 60:.1f} minutes detected",
|
||||
**gap_info
|
||||
})
|
||||
|
||||
valid = len(gaps) == 0
|
||||
print(f"[CHECK] Found {len(gaps)} gaps > {self.max_gap_seconds / 60} minutes")
|
||||
|
||||
return {
|
||||
"valid": valid,
|
||||
"gaps_found": len(gaps),
|
||||
"gaps": gaps[:10] if gaps else [], # Limit to first 10 for report
|
||||
"total_gaps": len(gaps)
|
||||
}
|
||||
|
||||
def check_duplicates(self, data: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""
|
||||
Check for duplicate timestamps.
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
|
||||
Returns:
|
||||
Duplicate check report
|
||||
"""
|
||||
print("[CHECK] Checking for duplicate timestamps...")
|
||||
|
||||
timestamps = self._extract_timestamps(data)
|
||||
timestamp_counts = defaultdict(int)
|
||||
|
||||
for ts in timestamps:
|
||||
timestamp_counts[ts] += 1
|
||||
|
||||
duplicates = {ts: count for ts, count in timestamp_counts.items() if count > 1}
|
||||
|
||||
if duplicates:
|
||||
for ts, count in list(duplicates.items())[:10]: # Limit to first 10
|
||||
self.issues.append({
|
||||
"type": "duplicate_timestamp",
|
||||
"severity": "error",
|
||||
"timestamp": self._format_timestamp(ts),
|
||||
"count": count,
|
||||
"message": f"Timestamp appears {count} times"
|
||||
})
|
||||
|
||||
valid = len(duplicates) == 0
|
||||
print(f"[CHECK] Found {len(duplicates)} duplicate timestamps")
|
||||
|
||||
return {
|
||||
"valid": valid,
|
||||
"duplicates_found": len(duplicates),
|
||||
"duplicate_timestamps": len(duplicates)
|
||||
}
|
||||
|
||||
def check_price_range(self, data: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""
|
||||
Check for invalid or outlier prices.
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
|
||||
Returns:
|
||||
Price range check report
|
||||
"""
|
||||
print("[CHECK] Checking price ranges...")
|
||||
|
||||
prices = self._extract_prices(data)
|
||||
|
||||
if not prices:
|
||||
return {"valid": True, "note": "No price data to validate"}
|
||||
|
||||
# Check for negative prices
|
||||
negative_prices = [p for p in prices if p < 0]
|
||||
|
||||
# Check for zero prices (unusual for ES/NQ)
|
||||
zero_prices = [p for p in prices if p == 0]
|
||||
|
||||
# Calculate statistics for outlier detection
|
||||
if len(prices) > 1:
|
||||
mean_price = sum(prices) / len(prices)
|
||||
variance = sum((p - mean_price) ** 2 for p in prices) / len(prices)
|
||||
std_dev = variance ** 0.5
|
||||
|
||||
# Detect outliers (> N standard deviations from mean)
|
||||
outliers = []
|
||||
for p in prices:
|
||||
if abs(p - mean_price) > (self.price_outlier_std * std_dev):
|
||||
outliers.append(p)
|
||||
if len(outliers) <= 10: # Limit issues
|
||||
self.issues.append({
|
||||
"type": "price_outlier",
|
||||
"severity": "warning",
|
||||
"price": p,
|
||||
"mean": mean_price,
|
||||
"std_dev": std_dev,
|
||||
"message": f"Price {p:.2f} is {abs(p - mean_price) / std_dev:.1f} std devs from mean"
|
||||
})
|
||||
else:
|
||||
outliers = []
|
||||
mean_price = prices[0] if prices else 0
|
||||
std_dev = 0
|
||||
|
||||
# Report negative prices as errors
|
||||
for p in negative_prices[:10]: # Limit to first 10
|
||||
self.issues.append({
|
||||
"type": "negative_price",
|
||||
"severity": "error",
|
||||
"price": p,
|
||||
"message": f"Negative price detected: {p}"
|
||||
})
|
||||
|
||||
valid = len(negative_prices) == 0 and len(zero_prices) == 0
|
||||
|
||||
print(f"[CHECK] Price range: {min(prices):.2f} to {max(prices):.2f}")
|
||||
print(f"[CHECK] Negative prices: {len(negative_prices)}, Zero prices: {len(zero_prices)}, Outliers: {len(outliers)}")
|
||||
|
||||
return {
|
||||
"valid": valid,
|
||||
"min_price": min(prices),
|
||||
"max_price": max(prices),
|
||||
"mean_price": mean_price,
|
||||
"std_dev": std_dev,
|
||||
"negative_prices": len(negative_prices),
|
||||
"zero_prices": len(zero_prices),
|
||||
"outliers": len(outliers)
|
||||
}
|
||||
|
||||
def check_record_count(self, data: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""
|
||||
Verify expected record count.
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
|
||||
Returns:
|
||||
Record count check report
|
||||
"""
|
||||
print(f"[CHECK] Verifying record count: {len(data)} records")
|
||||
|
||||
# For OHLCV data, can estimate expected count based on timeframe
|
||||
expected_count = self._estimate_expected_count(data)
|
||||
|
||||
valid = True
|
||||
if expected_count and abs(len(data) - expected_count) > (expected_count * 0.1):
|
||||
# More than 10% deviation
|
||||
valid = False
|
||||
self.issues.append({
|
||||
"type": "unexpected_record_count",
|
||||
"severity": "warning",
|
||||
"actual": len(data),
|
||||
"expected": expected_count,
|
||||
"message": f"Expected ~{expected_count} records, got {len(data)}"
|
||||
})
|
||||
|
||||
return {
|
||||
"valid": valid,
|
||||
"actual_count": len(data),
|
||||
"expected_count": expected_count,
|
||||
"note": "Expected count is estimated based on schema and date range"
|
||||
}
|
||||
|
||||
def check_completeness(self, data: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""
|
||||
Check data completeness (required fields present).
|
||||
|
||||
Args:
|
||||
data: List of records
|
||||
|
||||
Returns:
|
||||
Completeness check report
|
||||
"""
|
||||
print("[CHECK] Checking data completeness...")
|
||||
|
||||
if not data:
|
||||
return {"valid": False, "note": "No data"}
|
||||
|
||||
# Check required fields based on schema
|
||||
required_fields = self._get_required_fields()
|
||||
|
||||
missing_fields = defaultdict(int)
|
||||
for record in data[:100]: # Sample first 100 records
|
||||
for field in required_fields:
|
||||
if field not in record or record[field] is None:
|
||||
missing_fields[field] += 1
|
||||
|
||||
if missing_fields:
|
||||
for field, count in missing_fields.items():
|
||||
self.issues.append({
|
||||
"type": "missing_field",
|
||||
"severity": "error",
|
||||
"field": field,
|
||||
"missing_count": count,
|
||||
"message": f"Field '{field}' missing in {count} records (sampled)"
|
||||
})
|
||||
|
||||
valid = len(missing_fields) == 0
|
||||
|
||||
return {
|
||||
"valid": valid,
|
||||
"missing_fields": dict(missing_fields) if missing_fields else {}
|
||||
}
|
||||
|
||||
def _extract_timestamps(self, data: List[Dict[str, Any]]) -> List[int]:
|
||||
"""Extract timestamps from records."""
|
||||
timestamps = []
|
||||
for record in data:
|
||||
# Try different timestamp field names
|
||||
ts = record.get("ts_event") or record.get("ts_recv") or record.get("timestamp")
|
||||
if ts:
|
||||
timestamps.append(int(ts))
|
||||
return timestamps
|
||||
|
||||
def _extract_prices(self, data: List[Dict[str, Any]]) -> List[float]:
|
||||
"""Extract prices from records."""
|
||||
prices = []
|
||||
for record in data:
|
||||
# For OHLCV, use close price
|
||||
if "close" in record:
|
||||
# Convert from fixed-point if needed
|
||||
price = record["close"]
|
||||
if isinstance(price, int) and price > 1_000_000:
|
||||
price = price / 1_000_000_000 # Fixed-point conversion
|
||||
prices.append(float(price))
|
||||
# For trades/mbp, use price field
|
||||
elif "price" in record:
|
||||
price = record["price"]
|
||||
if isinstance(price, int) and price > 1_000_000:
|
||||
price = price / 1_000_000_000
|
||||
prices.append(float(price))
|
||||
return prices
|
||||
|
||||
def _format_timestamp(self, ts_ns: int) -> str:
|
||||
"""Format nanosecond timestamp to readable string."""
|
||||
ts_seconds = ts_ns / 1_000_000_000
|
||||
dt = datetime.fromtimestamp(ts_seconds)
|
||||
return dt.strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
def _estimate_expected_count(self, data: List[Dict[str, Any]]) -> Optional[int]:
|
||||
"""Estimate expected record count based on schema and date range."""
|
||||
# This is a simplified estimation
|
||||
# In practice, would calculate based on actual date range
|
||||
if "ohlcv" in self.schema:
|
||||
if "1h" in self.schema:
|
||||
return None # ~24 records per day per symbol
|
||||
elif "1d" in self.schema:
|
||||
return None # ~1 record per day per symbol
|
||||
return None
|
||||
|
||||
def _get_required_fields(self) -> List[str]:
|
||||
"""Get required fields for schema."""
|
||||
base_fields = ["ts_event", "ts_recv"]
|
||||
|
||||
if "ohlcv" in self.schema:
|
||||
return base_fields + ["open", "high", "low", "close", "volume"]
|
||||
elif self.schema == "trades":
|
||||
return base_fields + ["price", "size"]
|
||||
elif "mbp" in self.schema:
|
||||
return base_fields + ["bid_px_00", "ask_px_00", "bid_sz_00", "ask_sz_00"]
|
||||
else:
|
||||
return base_fields
|
||||
|
||||
def print_report(self, report: Dict[str, Any]):
|
||||
"""Print validation report to console."""
|
||||
print("\n" + "=" * 60)
|
||||
print("DATA VALIDATION REPORT")
|
||||
print("=" * 60)
|
||||
|
||||
print(f"\nTotal Records: {report['total_records']}")
|
||||
print(f"Overall Valid: {'✓ YES' if report['valid'] else '✗ NO'}")
|
||||
|
||||
print("\n" + "-" * 60)
|
||||
print("CHECK RESULTS")
|
||||
print("-" * 60)
|
||||
|
||||
for check_name, check_result in report["checks"].items():
|
||||
status = "✓" if check_result.get("valid", True) else "✗"
|
||||
print(f"\n{status} {check_name.replace('_', ' ').title()}")
|
||||
for key, value in check_result.items():
|
||||
if key != "valid" and key != "gaps":
|
||||
print(f" {key}: {value}")
|
||||
|
||||
if report["issues"]:
|
||||
print("\n" + "-" * 60)
|
||||
print(f"ISSUES FOUND ({len(report['issues'])})")
|
||||
print("-" * 60)
|
||||
for i, issue in enumerate(report["issues"][:20], 1): # Limit to 20
|
||||
print(f"\n{i}. [{issue['severity'].upper()}] {issue['type']}")
|
||||
print(f" {issue['message']}")
|
||||
|
||||
if len(report["issues"]) > 20:
|
||||
print(f"\n... and {len(report['issues']) - 20} more issues")
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point for CLI usage."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Validate Databento market data quality"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--input",
|
||||
"-i",
|
||||
required=True,
|
||||
help="Input data file (JSON or CSV)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--schema",
|
||||
default="ohlcv-1h",
|
||||
help="Data schema (default: ohlcv-1h)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--max-gap-minutes",
|
||||
type=int,
|
||||
default=60,
|
||||
help="Maximum acceptable gap in minutes (default: 60)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--price-outlier-std",
|
||||
type=float,
|
||||
default=10.0,
|
||||
help="Standard deviations for outlier detection (default: 10.0)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--report",
|
||||
"-r",
|
||||
help="Save report to JSON file"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load data
|
||||
print(f"[LOAD] Loading data from {args.input}...")
|
||||
with open(args.input, 'r') as f:
|
||||
data = json.load(f)
|
||||
|
||||
# Handle different data formats
|
||||
if isinstance(data, dict) and "data" in data:
|
||||
data = data["data"]
|
||||
|
||||
# Create validator
|
||||
validator = DataValidator(
|
||||
schema=args.schema,
|
||||
max_gap_minutes=args.max_gap_minutes,
|
||||
price_outlier_std=args.price_outlier_std
|
||||
)
|
||||
|
||||
# Run validation
|
||||
report = validator.validate(data)
|
||||
|
||||
# Print report
|
||||
validator.print_report(report)
|
||||
|
||||
# Save report if requested
|
||||
if args.report:
|
||||
print(f"\n[SAVE] Saving report to {args.report}...")
|
||||
with open(args.report, 'w') as f:
|
||||
json.dump(report, f, indent=2)
|
||||
print(f"[SUCCESS] Report saved!")
|
||||
|
||||
# Exit with appropriate code
|
||||
sys.exit(0 if report["valid"] else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user