hist_backfill_fixes: solving conc issues in the tsdb backfiller #62

Open
goodboy wants to merge 44 commits from hist_backfill_fixes into main

Long lingering issues from various GH changesets which seem to be introduced as part of moving to our .parquet-file backed (colloquially dubbed nativedb) default/builting tsdb per,


Before landing,

  • document the piker store [ls/delete/ldshm] CLI
  • UX to handle the diff between valid venue closure time-gaps and bad(ly written) data gaps.
  • open PR for our new gap-annotator + remote-ctl API
    • new .ui._annotate.GapAnnotations(GraphicsObject): and surrounding serialization-wrapping impl by claudy which can aid in follow up to solve core conc-logic issues in the backfiller. (i.e. to avoid out-of-order/invalid tsdb writes at the oustet instead of the current workarounds using de-duplication/null-segment-filling helpers and other surrounding immediate hacks..)
      • will require pinning to upstream pyqtgraph!
        • do we keep maintaining our fork still tho?
        • do a diff on what’s there i guess ^?
      • ideally bring in the skills stuff in #69 beforehand!
  • follow up todos for integrating said gap-detector/checker into the chart actor’s core runtime and UX.
  • follow up for a UX to re-fill missing/bad ts data?

Follow up todos from GH,

Those that would be nice to knock out here but it’s fine if we just start tracking them throughout all follow up PRs.

Not sure how many are practical to (mark) solve(d) immediately but at least to get my head back in the problem set.

Long lingering issues from various GH changesets which seem to be introduced as part of moving to our `.parquet`-file backed (colloquially dubbed `nativedb`) default/builting tsdb per, --- #### Before landing, - [ ] document the `piker store [ls/delete/ldshm]` CLI - [ ] UX to handle the diff between valid venue closure time-gaps and bad(ly written) data gaps. - [ ] open PR for our new gap-annotator + remote-ctl API - new `.ui._annotate.GapAnnotations(GraphicsObject):` and surrounding serialization-wrapping impl by claudy which can aid in follow up to solve core conc-logic issues in the backfiller. (i.e. to avoid out-of-order/invalid tsdb writes at the oustet instead of the current workarounds using de-duplication/null-segment-filling helpers and other surrounding immediate hacks..) + [ ] will require pinning to upstream `pyqtgraph`! - do we keep maintaining our fork still tho? - do a diff on what's there i guess ^? + [ ] ideally bring in the skills stuff in #69 beforehand! - [ ] follow up todos for integrating said gap-detector/checker into the chart actor's core runtime and UX. - [ ] follow up for a UX to re-fill missing/bad ts data? --- #### Follow up todos from GH, Those that would be nice to knock out here but it's fine if we just start tracking them throughout all follow up PRs. Not sure how many are practical to (mark) solve(d) immediately but at least to get my head back in the problem set. - [ ] landing orig GH pr (as a formality), - https://github.com/pikers/piker/pull/486 - [ ] pull out issues from ^ (and any others) ideally using `claude` GH integration to summarize all the follow-up bugs solved here! - [ ] `/install-github-app`, - [ ] storage layer draft PR? - https://github.com/pikers/piker/pull/446 - [ ] various outstanding `tsdb` tagged stuffs, - https://github.com/pikers/piker/issues?q=is%3Aissue%20state%3Aopen%20label%3Atsdb
goodboy added 4 commits 2026-01-07 19:36:17 +00:00
3f674481d3 Factor to a new `.tsp._history` sub-mod
Cleaning out the `piker.tsp` pkg-mod to be only the (re)exports needed
for `._anal`/`._history` refs-use elsewhere!
a1b8554dfd `.tsp._history`: drop `feed_is_live` syncing, another seg flag
The `await feed_is_live.wait()` is more or less pointless and would only
cause slower startup afaig (as-far-as-i-grok) so i'm masking it here.
This also removes the final `strict_exception_groups=False` use from the
non-tests code base, flipping to the `tractor.trionics` collapser once
and for all!
goodboy force-pushed hist_backfill_fixes from 79eb8a1684 to d147bfe8c4 2026-01-16 02:27:31 +00:00 Compare
goodboy added 3 commits 2026-01-19 23:12:06 +00:00
2d4d7cca57 Fix polars 1.36.0 duration API
Polars tightened type safety for `.dt` accessor methods requiring
`total_*` methods for duration types vs datetime component accessors
like `day()` which now only work on datetime dtypes.

`detect_time_gaps()` in `.tsp._anal` was calling `.dt.day()`
on `dt_diff` column (a duration from `.diff()`) which throws
`InvalidOperationError` on modern polars.

Changes:
- use f-string to add pluralization to map time unit strings to
  `total_<unit>s` form for the new duration API.
- Handle singular/plural forms: 'day' -> 'days' -> 'total_days'
- Ensure trailing 's' before applying 'total_' prefix

Also updates inline comments explaining the polars type distinction
between datetime components vs duration totals.

Fixes `piker store ldshm` crashes on datasets with time gaps.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
f3530b2f6b Add `pexpect`-based `pdbp`-REPL offline helper
Add a new `snippets/claude_debug_helper.py` to
provide a programmatic interface to `tractor.pause()` debugger
sessions for incremental data inspection matching the interactive UX
but able to be run by `claude` "offline" since it can't seem to feed
stdin (so it claims) to the `pdb` instance due to lack of ability to
allocate a tty internally.

The script-wrapper is based on `tractor`'s `tests/devx/` suite's use of
`pexpect` patterns for driving `pdbp` prompts and thus enables
automated-offline execution of REPL-inspection commands **without**
using incremental-realtime output capture (like a human would use it).

Features:
- `run_pdb_commands()`: batch command execution
- `InteractivePdbSession`: context manager for step-by-step REPL interaction
- `expect()` wrapper: timeout handling with buffer display
- Proper stdin/stdout handling via `pexpect.spawn()`

Example usage:
```python
from debug_helper import InteractivePdbSession

with InteractivePdbSession(
    cmd='piker store ldshm zecusdt.usdtm.perp.binance'
) as session:
    session.run('deduped.shape')
    session.run('step_gaps.shape')
```

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
176090b234 Add vlm-based "smart" OHLCV de-duping & bar validation
Using `claude`, add a `.tsp._dedupe_smart` module that attemps "smarter"
duplicate bars by attempting to distinguish between erroneous bars
partially written during concurrent backfill race conditions vs.
**actual** data quality issues from historical providers.

Problem:
--------
Concurrent writes (live updates vs. backfilling) can result in create
duplicate timestamped ohlcv vars with different values. Some
potential scenarios include,

- a market live feed is cancelled during live update resulting in the
  "last" datum being partially updated with all the ticks for the
  time step.
- when the feed is rebooted during charting, the backfiller will not
  finalize this bar since rn it presumes it should only fill data for
  time steps not already in the tsdb storage.

Our current naive  `.unique()` approach obvi keeps the incomplete bar
and a "smarter" approach is to compare the provider's final vlm
amount vs. the maybe-cancelled tsdb's bar; a higher vlm value from
the provider likely indicates the cancelled-during-live-write and
**not** a datum discrepancy from said data provider.

Analysis (with `claude`) of `zecusdt` data revealed:
- 1000 duplicate timestamps
- 999 identical bars (pure duplicates from 2022 backfill overlap)
- 1 volume-monotonic conflict (live partial vs backfill complete)

A soln from `claude` -> `tsp._dedupe_smart.dedupe_ohlcv_smart()`
which:
- sorts by vlm **before** deduplication and keep the most complete
  bar based on vlm monotonicity as well as the following OHLCV
  validation assumptions:
  * volume should always increase
  * high should be non-decreasing,
  * low should be non-increasing
  * open should be identical
- Separates valid race conditions from provider data quality issues
  and reports and returns both dfs.

Change summary by `claude`:
- `.tsp._dedupe_smart`: new module with validation logic
- `.tsp.__init__`: expose `dedupe_ohlcv_smart()`
- `.storage.cli`: integrate smart dedupe, add logging for:
  * duplicate counts (identical vs monotonic races)
  * data quality violations (non-monotonic, invalid OHLC ranges)
  * warnings for provider data issues
- Remove `assert not diff` (duplicates are valid now)

Verified on `zecusdt`: correctly keeps index 3143645
(volume=287.777) over 3143644 (volume=140.299) for
conflicting 2026-01-16 18:54 UTC bar.

`claude`'s Summary of reasoning
-------------------------------
- volume monotonicity is critical: a bar's volume only increases
  during its time window.
- a backfilled bar should always have volume >= live updated.
- violations indicate any of:
  * Provider data corruption
  * Non-OHLCV aggregation semantics
  * Timestamp misalignment

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 4 commits 2026-01-22 02:44:21 +00:00
0d76323a90 binance: mk `AggTrade.nq` optional..
Oof! my bad.
Turns out spot pairs don't provide the `.nq` field looks like..
i guess i should not just test `.perp.` pairs all the time!

Bp
9ebb977731 Tolerate various "bad data" cases in `markup_gaps()`
Namely such that when the previous-df-row by our shm-abs-'index' doesn't
exist we ignore certain cases which are likely due to borked-but-benign
samples written to the tsdb or rt shm buffers prior.

Particularly we now ignore,
- any `dt`/`prev_dt` values which are UNIX-epoch timestamped (val of 0).
- any row-is-first-row in the df; there is no previous.
- any missing previous datum by 'index', in which case we lookup the
  `wdts` prior row and use that instead.
  * this would indicate a missing sample for the time-step but we can
    still detect a "gap" by looking at the prior row, by df-abs-index
    `i`, and use its timestamp to determine the period/size of missing
    samples (which need to likely still be retrieved).
  * in this case i'm leaving in a pause-point for introspecting these
    rarer cases when `--pdb` is passed via CLI.

Relatedly in the `piker store` CLI ep,
- add `--pdb` flag to `piker store`, pass it verbatim as `debug_mode`.
- when `times` has only a single row, don't calc a `period_s` median.
- only trace `null_segs` when in debug mode.
- always markup/dedupe gaps for `period_s==60`
goodboy force-pushed hist_backfill_fixes from 9ebb977731 to cd6bc105de 2026-01-22 03:39:11 +00:00 Compare
goodboy added 1 commit 2026-01-22 04:52:20 +00:00
goodboy added 2 commits 2026-01-26 03:18:56 +00:00
809ec6accb Arrow editor refinements in prep for gap checker
Namely exposing `ArrowEditor.add()` params to provide access to
coloring/transparency settings over the remote-ctl annotation API and
also adding a new `.remove_all()` to easily clear all arrows from
a single call. Also add `.remove()` compat methods to the other editors
(i.e. for lines, rects).
e77bec203d Add arrow indicators to time gaps
Such that they're easier to spot when zoomed out, a similar color to the
`RectItem`s and also remote-controlled via the `AnnotCtl` api.

Deats,
- request an arrow per gap from `markup_gaps()` using a new
  `.add_arrow()` meth, set the color, direction and alpha with
  position always as the `iend`/close of the last valid bar.
- extend the `.ui._remote_ctl` subys to support the above,
  * add a new `AnnotCtl.add_arrow()`.
  * add the service-side IPC endpoint for a 'cmd': 'ArrowEditor'.
- add a new `rm_annot()` helper to ensure the right graphics removal
  API is used by annotation type:
  * `pg.ArrowItem` looks up the `ArrowEditor` and calls `.remove(annot).
  * `pg.SelectRect` keeps with calling `.delete()`.
- global-ize an `_editors` table to enable the prior.
- add an explicit RTE for races on the chart-actor's `_dss` init.
goodboy added 1 commit 2026-01-26 16:45:00 +00:00
goodboy added 4 commits 2026-01-27 19:16:34 +00:00
4081336bd3 Catch too-early ib hist frames
For now by REPLing them and raising an RTE inside `.ib.feed` as well as
tracing any such cases that make it (from other providers) up to the
`.tsp._history` layer during null-segment backfilling.
goodboy added 1 commit 2026-01-27 19:19:12 +00:00
8701b517e7 Add `pexpect`, `xonsh`@github:main to deps
The former bc `claude` needs it for its new "offline" REPL simulator
script `snippets/claude_debug_helper.py` and pin to `xonsh` git mainline
to get the fancy new next cmd/suggestion prompt feats (which @goodboy is
using from `modden` already). Bump lock file to match.

Ah right, and for now while hackin pin to a local `tractor` Bp
goodboy added 5 commits 2026-01-28 01:52:05 +00:00
de5b1737b4 Add humanized duration labels to gap annotations
Introduce `humanize_duration()` helper in `.tsp._annotate` to
convert seconds to short human-readable format (d/h/m/s). Extend
annot-ctl API with `add_text()` method for placing `pg.TextItem`
labels on charts.

Also,
- add duration labels on RHS of gap arrows in `markup_gaps()`
- handle text item removal in `rm_annot()` match block
- expose `TextItem` cmd in `serve_rc_annots()` IPC handler
- use `hcolor()` for named-to-hex color conversion
- set anchor positioning for up vs down gaps

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-01-28 01:53:56 +00:00
1fb0fe3a04 Add `font_size` param to `AnnotCtl.add_text()` API
Expose font sizing control for `pg.TextItem` annotations thru the
annot-ctl API. Default to `_font.font.pixelSize() - 3` when no
size provided.

Also,
- thread `font_size` param thru IPC handler in `serve_rc_annots()`
- apply font via `QFont.setPixelSize()` on text item creation
- add `?TODO` note in `markup_gaps()` re using `conf.toml` value
- update `add_text()` docstring with font_size param desc

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-01-28 02:10:12 +00:00
4e3cd7f986 Drop decimal points for whole-number durations
Adjust `humanize_duration()` to show "3h" instead of "3.0h" when the
duration value is a whole number, making labels cleaner.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-01-28 04:52:19 +00:00
76f199df3b Add buffer capacity checks to backfill loop
Prevent `ValueError` from negative prepend index in
`start_backfill()` by checking buffer space before push
attempts. Truncate incoming frame if needed and stop gracefully
when buffer full.

Also,
- add pre-push capacity check with frame truncation logic
- stop backfill when `next_prepend_index <= 0`
- log warnings for capacity exceeded and buffer-full conditions

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-01-28 17:52:33 +00:00
51d109f7e7 Do time-based shm-index lookup for annots on server
Fix annotation misalignment during backfill by switching from
client-computed indices to server-side timestamp lookups against
current shm state. Store absolute coords on annotations and
reposition on viz redraws.

Lowlevel impl deats,
- add `time` param to `.add_arrow()`, `.add_text()`, `.add_rect()`
- lookup indices from shm via timestamp matching in IPC handlers
- force chart redraw before `markup_gaps()` annotation creation
- wrap IPC send/receive in `trio.fail_after(3)` for timeout when
  server fails to respond, likely hangs on no-case-match/error.
- cache `_meth`/`_kwargs` on rects, `_abs_x`/`_abs_y` on arrows
- auto-reposition all annotations after viz reset in redraw cmd

Also,
- handle `KeyError` for missing timeframes in chart lookup
- return `-1` aid on annotation creation failures (lol oh `claude`..)
- reconstruct rect positions from timestamps + BGM offset logic
- log repositioned annotation counts on viz redraw

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-01-28 19:45:18 +00:00
858cfce958 Relay annot creation failures with err-dict resps
Change annot-ctl APIs to return `None` on failure instead of invalid
`aid`s. Server now sends `{'error': msg}` dict on failures, client
match-blocks handle gracefully.

Also,
- update return types: `.add_rect()`, `.add_arrow()`, `.add_text()`
  now return `int|None`
- match on `{'error': str(msg)}` in client IPC receive blocks
- send error dicts from server on timestamp lookup failures
- add failure handling in `markup_gaps()` to skip bad rects

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 2 commits 2026-01-28 21:32:47 +00:00
88732a67d5 Add `get_fonts()` API and fix `.px_size` for non-Qt ctxs
Add a public `.ui._style.get_fonts()` helper to retrieve the
`_font[_small]: DpiAwareFont` singleton pair. Adjust
`DpiAwareFont.px_size` to return `conf.toml` value when Qt returns `-1`
(no active Qt app).

Also,
- raise `ValueError` with detailed msg if both Qt and a conf-lookup fail
- add some more type union whitespace cleanups: `int | None` -> `int|None`

(this commit-msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
3a515afccd Use `get_fonts()`, add `show_txt` flag to gap annots
Switch `.tsp._annotate.markup_gaps()` to use new
`.ui._style.get_fonts()` API for font size calc on client side and add
optional `show_txt: bool` flag to toggle gap duration labels (with
default `False`).

Also,
- replace `sgn` checks with named bools: `up_gap`, `down_gap`
- use `small_font.px_size - 1` for gap label font sizing
- wrap text creation in `if show_txt:` block
- update IPC handler to use `get_fonts()` vs direct `_font` import

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 5 commits 2026-01-30 19:53:15 +00:00
205058de21 Always overwrite tsdb duplicates found during backfill
Enable the previously commented-out dedupe-and-write logic in
`start_backfill()` to ensure tsdb stays clean of duplicate
entries.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
ec4e6ec742 ib.feed: drop legacy "quote-with-vlm" polling
Since now we explicitly check each mkt's venue hours now we don't need
this mega hacky "waiting on a quote with real vlm" stuff to determing
whether historical data should be loaded immediately. This approach also
had the added complexity that we needed to handle edge cases for tickers
(like xauusd.cmdty) which never have vlm.. so it's nice to be rid of it
all ;p
goodboy added 2 commits 2026-01-30 20:40:10 +00:00
bac8317a4a Add `get_godw()` singleton getter for `GodWidget`
Expose `get_godw()` helper to retrieve the central `GodWidget`
instance from anywhere in the UI code. Set the singleton in
`_async_main()` on startup.

Also,
- add docstring to `run_qtractor()` explaining trio guest mode
- type annotate `instance: GodWidget` in `run_qtractor()`
- import reorg in `._app` for cleaner grouping
- whitespace cleanup: `Type | None` -> `Type|None` throughout
- fix bitwise-or alignment: `Flag | Other` -> `Flag|Other`

(this commit-msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 2 commits 2026-01-30 23:48:34 +00:00
d5edd3484f Clarify `register_with_sampler()` started type and vars
Markup `ctx.started()` type-sig as `set[int]`, rename binding var
`first` to `shm_periods` and add type hints for clarity on context mgr
unpacking.

Also,
- whitespace cleanup: `Type | None` -> `Type|None` throughout
- format long lines: `.setdefault()`, `await ctx.started()`
- fix backtick style in docstrings for consistency
- add placeholder TODO comment for `feed_is_live` check; it might be
  more rigorous to pass the syncing state down thru all this?

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
f73b981173 Only register shms w sampler when `feed_is_live`
Add timeout-gated wait for `feed_is_live: trio.Event` before passing shm
tokens to `open_sample_stream()`; skip registering shm-buffers with the
sampler if the feed doesn't "go live" within a new timeout.

The main motivation here is to avoid the sampler incrementing shm-array
bufs when the mkt-venue is closed so that a trailing "same price"
line/bars isn't updated/rendered in the chart's view when unnecessary.

Deats,
- add `wait_for_live_timeout: float = 0.5` param to `manage_history()`
- warn-log the fqme when timeout triggers
- add error log for invalid `frame_start_dt` comparisons to
  `maybe_fill_null_segments()`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-01-30 23:51:03 +00:00
48493e50b0 .ib.feed: only set `feed_is_live` after first quote
Move `feed_is_live.set()` to after receiving the first valid
quote instead of setting early on venue-closed path. Prevents
sampler registration when no live data expected.

Also,
- drop redundant `.set()` call in quote iteration loop
- add TODO note about sleeping until venue opens vs forever
- init `first_quote: dict` early for consistency

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-01-31 00:21:44 +00:00
2d678e1582 Guard against `None` chart in `ArrowEditor.remove()`
Add null check for `linked.chart` before calling
`.plotItem.removeItem()` to prevent `AttributeError` when chart
is `None`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 1 commit 2026-02-02 00:39:38 +00:00
6f8a361e80 Cleanups and doc tweaks to `.ui._fsp`
Expand read-race warning log for clarity, add TODO for reading
`tractor` transport config from `conf.toml`, and reflow docstring
in `open_vlm_displays()`.

Also,
- whitespace cleanup: `Type | None` -> `Type|None`
- clarify "Volume" -> "Vlm (volume)" in docstr

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
This Pull Request doesn't have enough approvals yet. 0 of 1 approvals granted.
You are not authorized to merge this pull request.
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: pikers/piker#62
There is no content yet.