Design: data schema for opts ts storage #24

Open
opened 2025-02-06 18:02:20 +00:00 by ntorres · 5 comments
Collaborator

The fundamental “source” data schema we need to store for computing max pain at this moment:

{
   'timestamp': int,
   'strike_price': int,
   'option_type': str,
   'open_interest': Decimal,
}

Maybe we need to evaluate which data we want to add to the above data schema.

For the storage we need to add methods like write_ohlcv() and read_ohlcv() for the new data schema: write_oi() and read_oi() (of course, this can be change)

Also, add a method like mk_ohlcv_shm_keyed_filepath() for the open interest shm keyed filepath: mk_oi_shm_keyed_filepath()

The fundamental "source" data schema we need to store for computing max pain at this moment: ``` { 'timestamp': int, 'strike_price': int, 'option_type': str, 'open_interest': Decimal, } ``` Maybe we need to evaluate which data we want to add to the above data schema. For the storage we need to add methods like `write_ohlcv()` and `read_ohlcv()` for the new data schema: `write_oi()` and `read_oi()` (of course, this can be change) Also, add a method like `mk_ohlcv_shm_keyed_filepath()` for the open interest shm keyed filepath: `mk_oi_shm_keyed_filepath()`
ntorres changed title from Add storage support for derivs to Design: data schema for opts ts storage 2025-02-18 16:39:50 +00:00

@ntorres one thing i can see immediately missing is an expiry field 😉


One in depth question i have (and don’t have an answer for yet) is whether we want to to orient the table schema such that all contracts can be interleaved in a big continuous table (for say a given strike/expiry/type) and then broken up for processing/viewing via a polars unpack or should we be keeping all contracts separate files entirely for long term storage and then expecting .parquet loader code to do the correct (parsing of filenames) thing to make it easy to load multiple contract-mkts at once?

@ntorres one thing i can see immediately missing is an expiry field 😉 --- One in depth question i have (and don't have an answer for yet) is whether we want to to orient the table schema such that all contracts can be *interleaved* in a big continuous table (for say a given strike/expiry/type) and then broken up for processing/viewing via a `polars` unpack or should we be keeping all contracts separate files entirely for long term storage and then expecting `.parquet` loader code to do the correct (parsing of filenames) thing to make it easy to load multiple contract-mkts at once?

Ok, yeah after a little thought i figured i’d make a table diagram (see attached) of what i envision being the most useful joined end-table for analysis and processing (like max-pain).

I’m hoping to re-render this using a d2 table (maybe inside a larger diagram) and/or possibly using a real screen shot from

summary of approaches


granular per-mkt named by FQME files

the most file-granular approach would be to keep the table fields relatively simple with a min 2 but possibly optionally 3:

  • time[_ns]: int|float, the timestamp of the OI update
  • oi: float|int, the actual purported open interest of the contract reported by a provider (feed)
  • oi_cal: float|int, the piker options sys calculated by us OI based on a mkt’s prior state (initial OI) and ongoing (real-time) vlm feed (likely ohlcv of some sort)

this would mean we leverage a unique fqme schema to index mkts in the file sys very distinctly in such a way that,

  • it’s easy to tell which .parquet file contains which deriv mkts data by glancing at the filesys
    • this also makes it easy to use tools like visidata (which yes supports parquest ;) from console
  • we can be more flexible with how backends provide more then a minimum required field set just like we do currently for the the ohlcv time series historical trades data B)
    • means the downstream consuming piker.fsp and piker.ui consumers can be more flexible with logic around certain fields existing optionally from various providers and then dynamically processing/displaying different output
  • the 2nd approach below (the fully joined table from all provider data feeds) can be rendered using an aggregation operation in polars when needed but will have the further flexibility that not all mkts for a given “super scope” need to be loaded into mem if the user doesn’t require it
    • this comes obvi at the higher cost of conducting more frequent load and aggregate ops when a user dynamically wants to change the data set they’re using

all-in-one-table NON-granular, which would mean less .parquet files

not my preferred approach since it means constantly having to load a large table scoped by some grander mkt-property (like provider or expiry or settling asset etc.)

However, this likely would result in a much simpler .piker.storage implementation as well as improved loading performance for very large (multi assert) derivatives data sets since all “sub contract mkts” could be allocated in a single table like originally put in the descr:


{
   'time[_ns]': int|float,
   'expiry': `Datetime|str`
   'strike_price': int|float,
   'option_type': str|OptType,
   'oi': Decimal,
   'oi_calc': Decimal,
}
Ok, yeah after a little thought i figured i'd make a table diagram (see attached) of what i envision being the most useful *joined* end-table for analysis and processing (like max-pain). I'm hoping to re-render this using a `d2` table (maybe inside a larger diagram) and/or possibly using a real screen shot from #### summary of approaches --- ##### granular per-mkt named by FQME files the most file-granular approach would be to keep the table fields relatively simple with a min 2 but possibly optionally 3: - `time[_ns]: int|float`, the timestamp of the OI update - `oi: float|int`, the actual purported open interest of the contract **reported by a provider** (feed) - `oi_cal: float|int`, the `piker` options sys **calculated by us** OI based on a mkt's prior state (initial OI) and ongoing (real-time) vlm feed (likely ohlcv of some sort) **this would mean** we leverage a unique [fqme](https://github.com/pikers/piker/issues/467) schema to index mkts in the file sys very distinctly in such a way that, - it's easy to tell which `.parquet` file contains which deriv mkts data by glancing at the filesys - this also makes it easy to use tools like [`visidata`](https://www.visidata.org/docs/formats/) (which yes supports `parquest` ;) from console - we can be more flexible with how backends provide **more then a minimum required field set** just like we do currently for the the ohlcv time series historical trades data B) - means the downstream consuming `piker.fsp` and `piker.ui` consumers can be more flexible with logic around certain fields existing optionally from various providers and then dynamically processing/displaying different output - the 2nd approach below (the fully joined table from all provider data feeds) can be rendered using an aggregation operation in `polars` when needed but will have the further flexibility that not all mkts for a given "super scope" **need to be loaded into mem** if the user doesn't require it - this comes obvi at the higher cost of conducting more frequent load and aggregate ops when a user dynamically wants to change the data set they're using --- ##### all-in-one-table NON-granular, which would mean **less `.parquet` files** not my preferred approach since it means constantly having to load a large table scoped by some grander mkt-property (like provider or expiry or settling asset etc.) However, this likely would result in a much simpler `.piker.storage` implementation as well as improved loading performance for very large (multi assert) derivatives data sets since all "sub contract mkts" could be allocated in a single table like originally put in the descr: ```python { 'time[_ns]': int|float, 'expiry': `Datetime|str` 'strike_price': int|float, 'option_type': str|OptType, 'oi': Decimal, 'oi_calc': Decimal, } ```

Another impl detail wrt how a datad provider can offer OI info for a deriv..

Likely they either provide it like we should,

  • as an additional field oi/oi_calc updated with every clearing event (every tick that contains non-zero vlm) or,
  • as a separate API-feed endpoint (more commonly i expect to see this like with deribit)

how we want to aggregate

given there is likely going to be 2 feeds for most providers,

  • a normal trades/book data feed both live and historical (as ohlcv)
  • a derivs info feed which delivers OI updates and any other contract-type measures

we need a way to merge these to enable getting single table for processing such that you can easily get the normaly trade event info (ticks) but with at least an added oi/oi_calc field included as a column.

Another impl detail wrt how a `datad` provider can offer OI info for a deriv.. Likely they either provide it like we should, - as an additional field `oi`/`oi_calc` updated with every clearing event (every tick that contains non-zero vlm) or, - **as a separate API-feed endpoint** (more commonly i expect to see this like with `deribit`) --- #### how we want to aggregate given there is likely going to be 2 feeds for most providers, - a normal trades/book data feed both live and historical (as ohlcv) - a derivs info feed which delivers OI updates and any other contract-type measures we need a way to merge these to enable getting single table for processing such that you can easily get the normaly trade event info (ticks) but with at least an added `oi`/`oi_calc` field included as a column.
Poster
Collaborator

I’m debating myself on which “table storage” approach to take, if we have a solid fqme, I think its better to go for a granular one (granular per-mkt named by FQME files), and then we can load whatever fqme we want

I'm debating myself on which "table storage" approach to take, if we have a solid fqme, I think its better to go for a granular one (granular per-mkt named by FQME files), and then we can load whatever fqme we want
Poster
Collaborator

for deribit a fqme looks like this: btc-26feb25-90k-c.reversed_option.deribit, we already have: currency, expiry_date, strike_price, instrument_kind, option_type and exchange, so we need to store for each fqme this:

{
   'time[_ns]': int|float,
   'oi': Decimal,
   'oi_calc': Decimal,
}

And then write all the machinery necessary to handle the fqme files (something like write_derv, get_by_strike, get_by_expiry_date, etc)

for deribit a fqme looks like this: `btc-26feb25-90k-c.reversed_option.deribit`, we already have: `currency`, `expiry_date`, `strike_price`, `instrument_kind`, `option_type` and `exchange`, so we need to store for each fqme this: ``` { 'time[_ns]': int|float, 'oi': Decimal, 'oi_calc': Decimal, } ``` And then write all the machinery necessary to handle the fqme files (something like write_derv, get_by_strike, get_by_expiry_date, etc)
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: pikers/piker#24
There is no content yet.