Design: data schema for opts ts storage #24

New Issue

ntorres · 2025-02-06T18:02:20Z

ntorres commented

2025-02-06 18:02:20 +00:00

The fundamental “source” data schema we need to store for computing max pain at this moment:

{
   'timestamp': int,
   'strike_price': int,
   'option_type': str,
   'open_interest': Decimal,
}

Maybe we need to evaluate which data we want to add to the above data schema.

For the storage we need to add methods like write_ohlcv() and read_ohlcv() for the new data schema: write_oi() and read_oi() (of course, this can be change)

Also, add a method like mk_ohlcv_shm_keyed_filepath() for the open interest shm keyed filepath: mk_oi_shm_keyed_filepath()

The fundamental "source" data schema we need to store for computing max pain at this moment: ``` { 'timestamp': int, 'strike_price': int, 'option_type': str, 'open_interest': Decimal, } ``` Maybe we need to evaluate which data we want to add to the above data schema. For the storage we need to add methods like `write_ohlcv()` and `read_ohlcv()` for the new data schema: `write_oi()` and `read_oi()` (of course, this can be change) Also, add a method like `mk_ohlcv_shm_keyed_filepath()` for the open interest shm keyed filepath: `mk_oi_shm_keyed_filepath()`

ntorres changed title from ~~Add storage support for derivs~~ to Design: data schema for opts ts storage

2025-02-18 16:39:50 +00:00

goodboy commented

2025-02-18 16:44:41 +00:00

@ntorres one thing i can see immediately missing is an expiry field 😉

One in depth question i have (and don’t have an answer for yet) is whether we want to to orient the table schema such that all contracts can be interleaved in a big continuous table (for say a given strike/expiry/type) and then broken up for processing/viewing via a polars unpack or should we be keeping all contracts separate files entirely for long term storage and then expecting .parquet loader code to do the correct (parsing of filenames) thing to make it easy to load multiple contract-mkts at once?

@ntorres one thing i can see immediately missing is an expiry field 😉 --- One in depth question i have (and don't have an answer for yet) is whether we want to to orient the table schema such that all contracts can be *interleaved* in a big continuous table (for say a given strike/expiry/type) and then broken up for processing/viewing via a `polars` unpack or should we be keeping all contracts separate files entirely for long term storage and then expecting `.parquet` loader code to do the correct (parsing of filenames) thing to make it easy to load multiple contract-mkts at once?

goodboy commented

2025-02-18 17:52:10 +00:00

Ok, yeah after a little thought i figured i’d make a table diagram (see attached) of what i envision being the most useful joined end-table for analysis and processing (like max-pain).

I’m hoping to re-render this using a d2 table (maybe inside a larger diagram) and/or possibly using a real screen shot from

summary of approaches

granular per-mkt named by FQME files

the most file-granular approach would be to keep the table fields relatively simple with a min 2 but possibly optionally 3:

time[_ns]: int|float, the timestamp of the OI update
oi: float|int, the actual purported open interest of the contract reported by a provider (feed)
oi_cal: float|int, the piker options sys calculated by us OI based on a mkt’s prior state (initial OI) and ongoing (real-time) vlm feed (likely ohlcv of some sort)

this would mean we leverage a unique fqme schema to index mkts in the file sys very distinctly in such a way that,

it’s easy to tell which .parquet file contains which deriv mkts data by glancing at the filesys
- this also makes it easy to use tools like visidata (which yes supports parquest ;) from console
we can be more flexible with how backends provide more then a minimum required field set just like we do currently for the the ohlcv time series historical trades data B)
- means the downstream consuming piker.fsp and piker.ui consumers can be more flexible with logic around certain fields existing optionally from various providers and then dynamically processing/displaying different output
the 2nd approach below (the fully joined table from all provider data feeds) can be rendered using an aggregation operation in polars when needed but will have the further flexibility that not all mkts for a given “super scope” need to be loaded into mem if the user doesn’t require it
- this comes obvi at the higher cost of conducting more frequent load and aggregate ops when a user dynamically wants to change the data set they’re using

all-in-one-table NON-granular, which would mean less `.parquet` files

not my preferred approach since it means constantly having to load a large table scoped by some grander mkt-property (like provider or expiry or settling asset etc.)

However, this likely would result in a much simpler .piker.storage implementation as well as improved loading performance for very large (multi assert) derivatives data sets since all “sub contract mkts” could be allocated in a single table like originally put in the descr:


{
   'time[_ns]': int|float,
   'expiry': `Datetime|str`
   'strike_price': int|float,
   'option_type': str|OptType,
   'oi': Decimal,
   'oi_calc': Decimal,
}

Ok, yeah after a little thought i figured i'd make a table diagram (see attached) of what i envision being the most useful *joined* end-table for analysis and processing (like max-pain). I'm hoping to re-render this using a `d2` table (maybe inside a larger diagram) and/or possibly using a real screen shot from #### summary of approaches --- ##### granular per-mkt named by FQME files the most file-granular approach would be to keep the table fields relatively simple with a min 2 but possibly optionally 3: - `time[_ns]: int|float`, the timestamp of the OI update - `oi: float|int`, the actual purported open interest of the contract **reported by a provider** (feed) - `oi_cal: float|int`, the `piker` options sys **calculated by us** OI based on a mkt's prior state (initial OI) and ongoing (real-time) vlm feed (likely ohlcv of some sort) **this would mean** we leverage a unique [fqme](https://github.com/pikers/piker/issues/467) schema to index mkts in the file sys very distinctly in such a way that, - it's easy to tell which `.parquet` file contains which deriv mkts data by glancing at the filesys - this also makes it easy to use tools like [`visidata`](https://www.visidata.org/docs/formats/) (which yes supports `parquest` ;) from console - we can be more flexible with how backends provide **more then a minimum required field set** just like we do currently for the the ohlcv time series historical trades data B) - means the downstream consuming `piker.fsp` and `piker.ui` consumers can be more flexible with logic around certain fields existing optionally from various providers and then dynamically processing/displaying different output - the 2nd approach below (the fully joined table from all provider data feeds) can be rendered using an aggregation operation in `polars` when needed but will have the further flexibility that not all mkts for a given "super scope" **need to be loaded into mem** if the user doesn't require it - this comes obvi at the higher cost of conducting more frequent load and aggregate ops when a user dynamically wants to change the data set they're using --- ##### all-in-one-table NON-granular, which would mean **less `.parquet` files** not my preferred approach since it means constantly having to load a large table scoped by some grander mkt-property (like provider or expiry or settling asset etc.) However, this likely would result in a much simpler `.piker.storage` implementation as well as improved loading performance for very large (multi assert) derivatives data sets since all "sub contract mkts" could be allocated in a single table like originally put in the descr: ```python { 'time[_ns]': int|float, 'expiry': `Datetime|str` 'strike_price': int|float, 'option_type': str|OptType, 'oi': Decimal, 'oi_calc': Decimal, } ```

goodboy commented

2025-02-19 17:16:43 +00:00

Another impl detail wrt how a datad provider can offer OI info for a deriv..

Likely they either provide it like we should,

as an additional field oi/oi_calc updated with every clearing event (every tick that contains non-zero vlm) or,
as a separate API-feed endpoint (more commonly i expect to see this like with deribit)

how we want to aggregate

given there is likely going to be 2 feeds for most providers,

a normal trades/book data feed both live and historical (as ohlcv)
a derivs info feed which delivers OI updates and any other contract-type measures

we need a way to merge these to enable getting single table for processing such that you can easily get the normaly trade event info (ticks) but with at least an added oi/oi_calc field included as a column.

Another impl detail wrt how a `datad` provider can offer OI info for a deriv.. Likely they either provide it like we should, - as an additional field `oi`/`oi_calc` updated with every clearing event (every tick that contains non-zero vlm) or, - **as a separate API-feed endpoint** (more commonly i expect to see this like with `deribit`) --- #### how we want to aggregate given there is likely going to be 2 feeds for most providers, - a normal trades/book data feed both live and historical (as ohlcv) - a derivs info feed which delivers OI updates and any other contract-type measures we need a way to merge these to enable getting single table for processing such that you can easily get the normaly trade event info (ticks) but with at least an added `oi`/`oi_calc` field included as a column.

ntorres commented

2025-02-25 17:23:18 +00:00

Poster

For a granular per-mkt named by FQME files approach.

Open interest storage: #42

Here we’re using the storage.nativedb mod to write in a parquet file 8f1e082c91, at the moment it creates a file if it doesn’t exist for the fqme and append the last process row.

As discuss above this is the structure dtype:

# Define the structured dtype
dtype = np.dtype([
    ('time', int),
    ('oi', float),
    ('oi_calc', float),
])

The file name is the fqme for each instrument, in the max_pain we construct this for each incoming msg, a better solution is to pass the fqme already build, fqme github discussion
```
# Parquet filename
col_sym_key = f'btc-{expiry_date.lower()}-{strike_price}-{option_type}'
```

So far this script appends the last row on each update for the instruments on separate files for a specifics expiry_date

Add write_oi() and _write_oi() interface and impl for writting in parquet the oi struct in storage.nativedb, also add mk_oi_shm_keyed_filepath() for managing shm file.

For a **granular per-mkt named by FQME files** approach. ### Open interest storage: #42 Here we're using the `storage.nativedb` mod to write in a parquet file https://pikers.dev/pikers/piker/commit/8f1e082c91197c6c463641c46146d72ffbf2f992, at the moment it creates a file if it doesn't exist for the `fqme` and append the last process row. - As discuss above this is the structure dtype: ``` # Define the structured dtype dtype = np.dtype([ ('time', int), ('oi', float), ('oi_calc', float), ]) ``` - The file name is the `fqme` for each instrument, in the `max_pain` we construct this for each incoming msg, a better solution is to pass the `fqme` already build, [`fqme` github discussion](https://github.com/pikers/piker/issues/467) ``` # Parquet filename col_sym_key = f'btc-{expiry_date.lower()}-{strike_price}-{option_type}' ``` So far this script appends the last row on each update for the instruments on separate files for a specifics `expiry_date` - Add `write_oi()` and `_write_oi()` interface and impl for writting in parquet the `oi struct` in `storage.nativedb`, also add `mk_oi_shm_keyed_filepath()` for managing shm file.

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: pikers/piker#24

Design: data schema for opts ts storage #24

summary of approaches

granular per-mkt named by FQME files

all-in-one-table NON-granular, which would mean less .parquet files

how we want to aggregate

Open interest storage: #42

all-in-one-table NON-granular, which would mean less `.parquet` files