Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:43:40 +08:00
commit d6cdda3f30
13 changed files with 4664 additions and 0 deletions

View File

@@ -0,0 +1,372 @@
# Databento Schema Reference
Comprehensive documentation of Databento schemas with field-level details, data types, and usage guidance.
## Schema Overview
Databento provides 12+ schema types representing different granularity levels of market data. All schemas share common timestamp fields for consistency.
## Common Fields (All Schemas)
Every schema includes these timestamp fields:
| Field | Type | Description | Unit |
|-------|------|-------------|------|
| `ts_event` | uint64 | Event timestamp from venue | Nanoseconds (Unix epoch) |
| `ts_recv` | uint64 | Databento gateway receipt time | Nanoseconds (Unix epoch) |
**Important:** Databento provides up to 4 timestamps per event for sub-microsecond accuracy.
## OHLCV Schemas
Candlestick/bar data at various time intervals.
### ohlcv-1s (1 Second Bars)
### ohlcv-1m (1 Minute Bars)
### ohlcv-1h (1 Hour Bars)
### ohlcv-1d (Daily Bars)
### ohlcv-eod (End of Day)
**Common OHLCV Fields:**
| Field | Type | Description | Unit |
|-------|------|-------------|------|
| `open` | int64 | Opening price | Fixed-point (divide by 1e9 for decimal) |
| `high` | int64 | Highest price | Fixed-point (divide by 1e9 for decimal) |
| `low` | int64 | Lowest price | Fixed-point (divide by 1e9 for decimal) |
| `close` | int64 | Closing price | Fixed-point (divide by 1e9 for decimal) |
| `volume` | uint64 | Total volume | Contracts/shares |
**When to Use:**
- **1h/1d**: Historical backtesting, multi-day analysis
- **1m**: Intraday strategy development
- **1s**: High-frequency analysis (use batch for large ranges)
- **eod**: Long-term investment analysis
**Pricing Format:**
Prices are in fixed-point notation. To convert to decimal:
```
decimal_price = int64_price / 1_000_000_000
```
For ES futures at 4500.00, the value would be stored as `4500000000000`.
## Trades Schema
Individual trade executions with price, size, and side information.
| Field | Type | Description | Values |
|-------|------|-------------|--------|
| `price` | int64 | Trade execution price | Fixed-point (÷ 1e9) |
| `size` | uint32 | Trade size | Contracts/shares |
| `action` | char | Trade action | 'T' = trade, 'C' = cancel |
| `side` | char | Aggressor side | 'B' = buy, 'S' = sell, 'N' = none |
| `flags` | uint8 | Trade flags | Bitmask |
| `depth` | uint8 | Depth level | Usually 0 |
| `ts_in_delta` | int32 | Time delta | Nanoseconds |
| `sequence` | uint32 | Sequence number | Venue-specific |
**When to Use:**
- Intraday order flow analysis
- Tick-by-tick backtesting
- Market microstructure research
- Volume profile analysis
**Aggressor Side:**
- `B` = Buy-side aggressor (market buy hit the ask)
- `S` = Sell-side aggressor (market sell hit the bid)
- `N` = Cannot be determined or not applicable
**Important:** For multi-day tick data, use batch downloads. Trades can generate millions of records per day.
## MBP-1 Schema (Market By Price - Top of Book)
Level 1 order book data showing best bid and ask.
| Field | Type | Description | Values |
|-------|------|-------------|--------|
| `price` | int64 | Reference price (usually last trade) | Fixed-point (÷ 1e9) |
| `size` | uint32 | Reference size | Contracts/shares |
| `action` | char | Book action | 'A' = add, 'C' = cancel, 'M' = modify, 'T' = trade |
| `side` | char | Order side | 'B' = bid, 'A' = ask, 'N' = none |
| `flags` | uint8 | Flags | Bitmask |
| `depth` | uint8 | Depth level | Always 0 for MBP-1 |
| `ts_in_delta` | int32 | Time delta | Nanoseconds |
| `sequence` | uint32 | Sequence number | Venue-specific |
| `bid_px_00` | int64 | Best bid price | Fixed-point (÷ 1e9) |
| `ask_px_00` | int64 | Best ask price | Fixed-point (÷ 1e9) |
| `bid_sz_00` | uint32 | Best bid size | Contracts/shares |
| `ask_sz_00` | uint32 | Best ask size | Contracts/shares |
| `bid_ct_00` | uint32 | Bid order count | Number of orders |
| `ask_ct_00` | uint32 | Ask order count | Number of orders |
**When to Use:**
- Bid/ask spread analysis
- Liquidity analysis
- Market microstructure studies
- Quote-based strategies
**Key Metrics:**
```
spread = ask_px_00 - bid_px_00
mid_price = (bid_px_00 + ask_px_00) / 2
bid_ask_imbalance = (bid_sz_00 - ask_sz_00) / (bid_sz_00 + ask_sz_00)
```
## MBP-10 Schema (Market By Price - 10 Levels)
Level 2 order book data showing 10 levels of depth.
**Fields:** Same as MBP-1, plus 9 additional levels:
- `bid_px_01` through `bid_px_09` (10 bid levels)
- `ask_px_01` through `ask_px_09` (10 ask levels)
- `bid_sz_01` through `bid_sz_09`
- `ask_sz_01` through `ask_sz_09`
- `bid_ct_01` through `bid_ct_09`
- `ask_ct_01` through `ask_ct_09`
**When to Use:**
- Order book depth analysis
- Liquidity beyond top of book
- Order flow imbalance at multiple levels
- Market impact modeling
**Important:** MBP-10 generates significantly more data than MBP-1. Use batch downloads for multi-day requests.
## MBO Schema (Market By Order)
Level 3 order-level data with individual order IDs - most granular.
| Field | Type | Description | Values |
|-------|------|-------------|--------|
| `order_id` | uint64 | Unique order ID | Venue-specific |
| `price` | int64 | Order price | Fixed-point (÷ 1e9) |
| `size` | uint32 | Order size | Contracts/shares |
| `flags` | uint8 | Flags | Bitmask |
| `channel_id` | uint8 | Channel ID | Venue-specific |
| `action` | char | Order action | 'A' = add, 'C' = cancel, 'M' = modify, 'F' = fill, 'T' = trade |
| `side` | char | Order side | 'B' = bid, 'A' = ask, 'N' = none |
| `ts_in_delta` | int32 | Time delta | Nanoseconds |
| `sequence` | uint32 | Sequence number | Venue-specific |
**When to Use:**
- Highest granularity order flow analysis
- Order-level reconstructions
- Advanced market microstructure research
- Queue position analysis
**Important:** MBO data is extremely granular and generates massive datasets. Always use batch downloads and carefully check costs.
## Definition Schema
Instrument metadata and definitions.
| Field | Type | Description |
|-------|------|-------------|
| `ts_recv` | uint64 | Receipt timestamp |
| `min_price_increment` | int64 | Minimum tick size |
| `display_factor` | int64 | Display factor for prices |
| `expiration` | uint64 | Contract expiration timestamp |
| `activation` | uint64 | Contract activation timestamp |
| `high_limit_price` | int64 | Upper price limit |
| `low_limit_price` | int64 | Lower price limit |
| `max_price_variation` | int64 | Maximum price move |
| `trading_reference_price` | int64 | Reference price |
| `unit_of_measure_qty` | int64 | Contract size |
| `min_price_increment_amount` | int64 | Tick value |
| `price_ratio` | int64 | Price ratio |
| `inst_attrib_value` | int32 | Instrument attributes |
| `underlying_id` | uint32 | Underlying instrument ID |
| `raw_instrument_id` | uint32 | Raw instrument ID |
| `market_depth_implied` | int32 | Implied depth |
| `market_depth` | int32 | Market depth |
| `market_segment_id` | uint32 | Market segment |
| `max_trade_vol` | uint32 | Maximum trade volume |
| `min_lot_size` | int32 | Minimum lot size |
| `min_lot_size_block` | int32 | Block trade minimum |
| `min_lot_size_round_lot` | int32 | Round lot minimum |
| `min_trade_vol` | uint32 | Minimum trade volume |
| `contract_multiplier` | int32 | Contract multiplier |
| `decay_quantity` | int32 | Decay quantity |
| `original_contract_size` | int32 | Original size |
| `trading_reference_date` | uint16 | Reference date |
| `appl_id` | int16 | Application ID |
| `maturity_year` | uint16 | Year |
| `decay_start_date` | uint16 | Decay start |
| `channel_id` | uint16 | Channel |
| `currency` | string | Currency code |
| `settl_currency` | string | Settlement currency |
| `secsubtype` | string | Security subtype |
| `raw_symbol` | string | Raw symbol |
| `group` | string | Instrument group |
| `exchange` | string | Exchange code |
| `asset` | string | Asset class |
| `cfi` | string | CFI code |
| `security_type` | string | Security type |
| `unit_of_measure` | string | Unit of measure |
| `underlying` | string | Underlying symbol |
| `strike_price_currency` | string | Strike currency |
| `instrument_class` | char | Class |
| `strike_price` | int64 | Strike price (options) |
| `match_algorithm` | char | Matching algorithm |
| `md_security_trading_status` | uint8 | Trading status |
| `main_fraction` | uint8 | Main fraction |
| `price_display_format` | uint8 | Display format |
| `settl_price_type` | uint8 | Settlement type |
| `sub_fraction` | uint8 | Sub fraction |
| `underlying_product` | uint8 | Underlying product |
| `security_update_action` | char | Update action |
| `maturity_month` | uint8 | Month |
| `maturity_day` | uint8 | Day |
| `maturity_week` | uint8 | Week |
| `user_defined_instrument` | char | User-defined |
| `contract_multiplier_unit` | int8 | Multiplier unit |
| `flow_schedule_type` | int8 | Flow schedule |
| `tick_rule` | uint8 | Tick rule |
**When to Use:**
- Understanding instrument specifications
- Calculating tick values
- Contract expiration management
- Symbol resolution and mapping
**Key Fields for ES/NQ:**
- `min_price_increment`: Tick size (0.25 for ES, 0.25 for NQ)
- `expiration`: Contract expiration timestamp
- `raw_symbol`: Exchange symbol
- `contract_multiplier`: Usually 50 for ES, 20 for NQ
## Statistics Schema
Market statistics and calculated metrics.
| Field | Type | Description |
|-------|------|-------------|
| `ts_recv` | uint64 | Receipt timestamp |
| `ts_ref` | uint64 | Reference timestamp |
| `price` | int64 | Reference price |
| `quantity` | int64 | Reference quantity |
| `sequence` | uint32 | Sequence number |
| `ts_in_delta` | int32 | Time delta |
| `stat_type` | uint16 | Statistic type |
| `channel_id` | uint16 | Channel ID |
| `update_action` | uint8 | Update action |
| `stat_flags` | uint8 | Statistic flags |
**Common Statistic Types:**
- Opening price
- Settlement price
- High/low prices
- Trading volume
- Open interest
**When to Use:**
- Official settlement prices
- Open interest analysis
- Exchange-calculated statistics
## Status Schema
Instrument trading status and state changes.
| Field | Type | Description |
|-------|------|-------------|
| `ts_recv` | uint64 | Receipt timestamp |
| `ts_event` | uint64 | Event timestamp |
| `action` | uint16 | Status action |
| `reason` | uint16 | Status reason |
| `trading_event` | uint16 | Trading event |
| `is_trading` | int8 | Trading flag (1 = trading, 0 = not trading) |
| `is_quoting` | int8 | Quoting flag |
| `is_short_sell_restricted` | int8 | Short sell flag |
**When to Use:**
- Detecting trading halts
- Understanding market status changes
- Filtering data by trading status
## Imbalance Schema
Order imbalance data for auctions and closes.
| Field | Type | Description |
|-------|------|-------------|
| `ts_recv` | uint64 | Receipt timestamp |
| `ts_event` | uint64 | Event timestamp |
| `ref_price` | int64 | Reference price |
| `auction_time` | uint64 | Auction timestamp |
| `cont_book_clr_price` | int64 | Continuous book clearing price |
| `auct_interest_clr_price` | int64 | Auction interest clearing price |
| `paired_qty` | uint64 | Paired quantity |
| `total_imbalance_qty` | uint64 | Total imbalance |
| `side` | char | Imbalance side ('B' or 'A') |
| `significant_imbalance` | char | Significance flag |
**When to Use:**
- Opening/closing auction analysis
- Imbalance trading strategies
- End-of-day positioning
## Schema Selection Decision Matrix
| Analysis Type | Recommended Schema | Alternative |
|---------------|-------------------|-------------|
| Daily backtesting | ohlcv-1d | ohlcv-1h |
| Intraday backtesting | ohlcv-1h, ohlcv-1m | trades |
| Spread analysis | mbp-1 | trades |
| Order flow | trades | mbp-1 |
| Market depth | mbp-10 | mbo |
| Tick-by-tick | trades | mbo |
| Liquidity analysis | mbp-1, mbp-10 | mbo |
| Contract specifications | definition | - |
| Settlement prices | statistics | definition |
| Trading halts | status | - |
| Auction analysis | imbalance | trades |
## Data Type Reference
### Fixed-Point Prices
All price fields are stored as int64 in fixed-point notation with 9 decimal places of precision.
**Conversion:**
```python
decimal_price = int64_price / 1_000_000_000
```
**Example:**
- ES at 4500.25 → stored as 4500250000000
- NQ at 15000.50 → stored as 15000500000000
### Timestamps
All timestamps are uint64 nanoseconds since Unix epoch (1970-01-01 00:00:00 UTC).
**Conversion to datetime:**
```python
import datetime
dt = datetime.datetime.fromtimestamp(ts_event / 1_000_000_000, tz=datetime.timezone.utc)
```
### Character Fields
Single-character fields (char) represent enums:
- Action: 'A' (add), 'C' (cancel), 'M' (modify), 'T' (trade), 'F' (fill)
- Side: 'B' (bid), 'A' (ask), 'N' (none/unknown)
## Performance Considerations
### Schema Size (Approximate bytes per record)
| Schema | Size | Records/GB |
|--------|------|------------|
| ohlcv-1d | ~100 | ~10M |
| ohlcv-1h | ~100 | ~10M |
| trades | ~50 | ~20M |
| mbp-1 | ~150 | ~6.7M |
| mbp-10 | ~500 | ~2M |
| mbo | ~80 | ~12.5M |
**Planning requests:**
- 1 day of ES trades ≈ 100K-500K records ≈ 5-25 MB
- 1 day of ES mbp-1 ≈ 1M-5M records ≈ 150-750 MB
- 1 year of ES ohlcv-1h ≈ 6K records ≈ 600 KB
Use these estimates to decide between timeseries (< 5GB) and batch downloads (> 5GB).