Rework low-level-runtime to enforce a msgspec-defined, SC-supervision-protocol for IPC Contexts #7
Loading…
Reference in New Issue
There is no content yet.
Delete Branch "runtime_to_msgspec"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
Now rebase-synced and up to date / ready for merge!
Synopsis
This is a heavy rework and refinement of the internal runtime to use a new
msgspec.Structderived message-set to encapsulate all RPC dialogs, i.e.trio.Task-level IPC messaging, in a rigorous “SC-transitive supervision protocol”; a strict set of message-types now enforce that RPC transactions/dialogs between parallel executing tasks adhere to a WIP SC-applied-to-RPC protocol specification.Such a thing was previously desired/described in,
High level concept(s) summary
a new IPC msg set defined in
.msg.typeswhich are required to encapsulate any actor-runtime-related IPC msg-packet exchange between actor tasks.the type hierarchy is from a
MsgType[msgspec.Struct]and subsequent derivedPayloadMsg[PayloadT]which are msgs which encapsulate a “payload” field:PayloadMsg.pld: msgspec.Raw:use of
Rawis so thattractorapplication code can declare a type-spec for various RPC payload msgs which then limits the allowed types for.pld: MyType1|MyType2IPC ctx endpoints can now declare a new
@tractor.context(pld_spec=MyType1|MyType2)which then configures an underlyingtractor.msg._ops.PldRx: a so called “payload receiver” which is configured to thetractor.Contextcurrently in use/scope (read bytractor.current_ipc_ctx())the reason to have a separate
PldRx.pld_dec: MsgDecis that it has the job of decoding the “payloads from user code” as defined by anypld_spec: Union[Type]defined on the ctx-endpoint (as above).further, in the future this allows us to let user code define a msg-state-machine as needed throughout the lifetime of a
Portal.open_context()call such that you can write very pedantic and dynamic IPC msg-flows if needed say within aContext.open_stream()phase - user code can define their sub-msg-spec throughout the IPC dialog.a fairly serious rework of the core RPC-msg-processing-loop to only expect and allow this
MsgTypeset on the wire guaranteeing that any IPCContextadheres to SC from initialStart,Started[PayloadT]request to finalError[Exception|ContextCancelled]|Return[PayloadT]as part of this moving the core RPC machinery to a new
tractor._rpcisolating it from the._runtime.Actor-primitive code which is it’s main task-parent and caller.integrating various IPC msgs with our
._exceptions.RemoteActorErrorthat remote errors and “inceptions” are relayed in a consistent, boxed approach via._exceptions.[un]pack_error().and a new
MsgTypeError(MTE) such that the._ipc.Channel/MsgTransportlayer can block nonMsgTypemsgs before they reach user code which instead will raise an MTE indicating the violation in the.msg.typessense.similarly
MsgTypeErrorwill be raised by the currentPldRxbased on the configured pld-spec further preventing bad IPC deliveredPayloadTinput from ever reaching user (app) code.TODOs: docs, diagrams and follow-ups detailing the mechanics
msgspecset# TODOthroughout the latest code (likely taken from #19) and broken into follow up issues!f2ce4a3, timeout bumpSet a diff `Msg.pld` spec per test and then send multiple types to a child actor making sure the child can only send certain types over a stream and fails with validation or decode errors ow. The test is also param-ed both with and without hooks demonstrating how a custom type, `NamespacePath`, needs them for effective use. The subactor IPC context child is passed a `expect_ipc_send: dict` which relays the values along with their expected `.send()`-ability. Deats on technical refinements: ------ - ------ - added a `iter_maybe_sends()` send-value-as-msg-auditor and predicate generator (literally) so as to be able to pre-determine if given the current codec and `send_values` which values are expected to be IPC transmittable. - as per ^, the diff value-msgs are first round-tripped inside a `Started` msg using the configured codec in the parent/root actor before bothering with using IPC primitives + a subactor; this is how the `expect_ipc_send` table is generated initially. - for serializing the specs (`Union[Type]`s as required by `msgspec`), added a pair of codec hooks: `enc/dec_type_union()` (that ideally we move into a `.msg` submod eventually) which code the type-values as a `list[str]` of names. - the `dec_` hook had to be modified to NOT raise an error when an invalid/unhandled value arrives, this is because we do NOT want the RPC msg handling loop to raise on the `async for msg in chan:` and instead prefer to ignore and warn (for now, but eventually respond with error msg - see notes in hook body) these msgs when sent during a streaming phase; `Context.started()` will however error on a bad input for the current msg-spec since it is part of the "cheap" dialog (again see notes in `._context`) wherein the `Started` msg is always roundtripped prior to `Channel.send()` to guarantee the child adheres to its own spec. - tossed in lotsa `print()`s for console groking of the run progress. Further notes on typed-msging breaking cancellation: ------ - ------ - turns out since the runtime's cancellation implementation, being done with `Actor.cancel()` methods and friends will actually break when a stringent spec is applied (eg. a single type-spec) since the return values from said methods are generally `bool`s.. - this means we do indeed need special handling of "runtime RPC method invocations" since ideally a user's msg-spec choices do not break core functionality on them XD => The obvi solution is to add a/some special sub-`Msg` types for such cases, possibly just a `RuntimeReturn(Return)` type that will always include a `.pld: bool` for these cancel methods such that their results are always handled without msg type errors. More to come on a (hopefully) elegant solution to that last bit!As detailed in the surrounding notes, it's pretty advantageous to always have the child context task ensure the first msg it relays back is msg-type checked against the current spec and thus `MsgCodec`. Implement the check via a simple codec-roundtrip of the `Started` msg such that the `.pld` payload is always validated before transit. This ensures the child will fail early and notify the parent before any streaming takes place (i.e. the "nasty" dialog protocol phase). The main motivation here is to avoid inter-actor task syncing bugs that are hard(er) to recover from and/or such as if an invalid typed msg is sent to the parent, who then ignores it (depending on config), and then the child thinks the parent is in some presumed state while the parent is still thinking a first msg has yet to arrive. Doing the stringent check on the sender side (i.e. the child is sending the "first" application msg via `.started()`) avoids/sidesteps dealing with such syncing/coordinated-state problems by keeping the entire IPC dialog in a "cheap" or "control" style transaction up until a stream is opened. Iow, the parent task's `.open_context()` block entry can't occur until the child side is definitely (as much as is possible with IPC msg type checking) in a correct state spec wise. During any streaming phase in the dialog the msg-type-checking is NOT done for performance (the "nasty" protocol phase) and instead any type errors are relayed back from the receiving side. I'm still unsure whether to take the same approach on the `Return` msg, since at that point erroring early doesn't benefit the parent task if/when a msg-type error occurs? Definitely more to ponder and tinker out here.. Impl notes: - a gotcha with the roundtrip-codec-ed msg is that it often won't match the input `value` bc in the `msgpack` case many native python sequence/collection types will map to a common array type due to the surjection that `msgpack`'s type-sys imposes. - so we can't assert that `started == rt_started` but it may be useful to at least report the diff of the type-reduced payload so that the caller can at least be notified how the input `value` might be better type-casted prior to call, for ex. pre-casting to `list`s. - added a `._strict_started: bool` that could provide the stringent checking if desired in the future. - on any validation error raise our `MsgTypeError` from it. - ALSO change over the lingering `.send_yield()` deprecated meth body to use a `Yield()`.These are likely temporary changes but still needed to actually see the desired/correct failures (of which 5 of 6 tests are supposed to fail rn) mostly to do with `Start` and `Return` msgs which are invalid under each test's applied msg-spec. Tweak set here: - bit more `print()`s in root and sub for grokin test flow. - never use `pytes.fail()` in subactor.. should know this by now XD - comment out some bits that can't ever pass rn and make the underlying expected failues harder to grok: - the sub's child-side-of-ctx task doing sends should only fail for certain msg types like `Started` + `Return`, `Yield`s are processed receiver/parent side. - don't expect `sent` list to match predicate set for the same reason as last bullet. The outstanding msg-type-semantic validation questions are: - how to handle `.open_context()` with an input `kwargs` set that doesn't adhere to the currently applied msg-spec? - should the initial `@acm` entry fail before sending to the child side? - where should received `MsgTypeError`s be raised, at the `MsgStream` `.receive()` or lower in the stack? - i'm thinking we should mk `MsgTypeError` derive from `RemoteActorError` and then have it be delivered as an error to the `Context`/`MsgStream` for per-ctx-task handling; would lead to more flexible/modular policy overrides in user code outside any defaults we provide.Such that `Channel.recv()` + `MsgpackTCPStream.recv()` originating msg-type-errors are not raised at the IPC transport layer but instead relayed up the runtime stack for eventual handling by user-app code via the `Context`/`MsgStream` layer APIs. This design choice leads to a substantial amount of flexibility and modularity, and avoids `MsgTypeError` handling policies from being coupled to a particular backend IPC transport layer: - receive-side msg-type errors, as can be raised and handled in the `.open_stream()` "nasty" phase of a ctx, whilst being packed at the `MsgCodec`/transport layer (keeping the underlying src decode error coupled to the specific transport + interchange lib) and then relayed upward to app code for custom handling like a normal Error` msg. - the policy options for handling such cases could be implemented as `@acm` wrappers around `.open_context()`/`.open_stream()` blocks (and their respective delivered primitives) OR just plain old async generators around `MsgStream.receive()` such that both built-in policy handling and custom user-app solutions can be swapped without touching any `tractor` internals or providing specialized "registry APIs". -> eg. the ignore and relay-invalid-msg-to-sender approach can be more easily implemented as embedded `try: except MsgTypeError:` blocks around `MsgStream.receive()` possibly applied as either of an injected wrapper type around a stream or an async gen that `async for`s from the stream. - any performance based AOT-lang extensions used to implement a policy for handling recv-side errors space can avoid knowledge of the lower level IPC `Channel` (and-downward) primitives. - `Context` consuming code can choose to let all msg-type-errs bubble and handle them manually (like any other remote `Error` shuttled exception). - we can keep (as before) send-side msg type checks can be raised locally and cause offending senders to error and adjust before the streaming phase of an IPC ctx. Impl (related) deats: - obvi make `MsgpackTCPStream.recv()` yield up any `MsgTypeError` constructed by `_mk_msg_type_err()` such that the exception will eventually be relayed up to `._rpc.process_messages()` and from their delivered to the corresponding ctx-task. - in support of ^, make `Channel.recv()` detect said mtes and use the new `pack_from_raise()` to inject the far end `Actor.uid` for the `Error.src_uid`. - keep raising the send side equivalent (when strict enabled) errors inline immediately with no upward `Error` packing or relay. - improve `_mk_msg_type_err()` cases handling with far more detailed `MsgTypeError` "message" contents pertaining to `msgspec` specific failure-fixing-tips and type-spec mismatch info: * use `.from_decode()` constructor in recv-side case to inject the non-spec decoded `msg_dict: dict` and use the new `MsgCodec.pld_spec_str: str` when clarifying the type discrepancy with the offending field. * on send-side, if we detect that an unsupported field type was described in the original `src_type_error`, AND there is no `msgpack.Encoder.enc_hook()` set, that the real issue is likely that the user needs to extend the codec to support the non-std/custom type with a hook and link to `msgspec` docs. * if one of a `src_type/validation_error` is provided, set that error as the `.__cause__` in the new mte.322e015dFix `mk_codec()` input argDisplay the new `MsgCodec.pld_spec_str` and format the incorrect field value to be placed entirely (txt block wise) right of the "type annot" part of the line: Iow if you had a bad `dict` value where something else should be it'd look something like this: <Started( |_pld: NamespacePath = {'cid': '3e0ca00c-7d32-4d2a-a0c2-ac2e12453871', 'locked': True, 'msg_type': 'LockStatus', 'subactor_uid': ['sub', 'af7ccb69-1dab-491f-84f7-2ec42c32d137']}Since it's totes possible to have a spec applied that won't permit `str`s, might as well formalize a small msg set for subactors to request the tree-wide TTY `Lock`. BTW, I'm prolly not going into every single change here in this first WIP since there's still a variety of broken stuff mostly to do with races on the codec apply being done in a `trio.lowleve.RunVar`; it should be re-done with a `ContextVar` such that each task does NOT mutate the global setting.. New msg set and usage is simply: - `LockStatus` which is the reponse msg delivered from `lock_tty_for_child()` - `LockRelease` a one-off request msg from the subactor to drop the `Lock` from a `MsgStream.send()`. - use these msgs throughout the root and sub sides of the locking ctx funcs: `lock_tty_for_child()` & `wait_for_parent_stdin_hijack()` The codec is now applied in both the root and sub `Lock` request tasks: - for root inside `lock_tty_for_child()` before the `.started()`. - for subs, inside `wait_for_parent_stdin_hijack()` since we only want to affect the codec *for the locking task*. - (hence the need for ctx-var as mentioned above but currently this can cause races which will break against other app tasks competing for the codec setting). - add a `apply_debug_codec()` helper for use in both cases. - add more detailed logging to both the root and sub side of `Lock` requesting funcs including requiring that the sub-side task "uid" (a `tuple[str, int]` = (trio.Task.name, id(trio.Task)` be provided (more on this later). A main issue discovered while proto-testing all this was the ability of a sub to "double lock" (leading to self-deadlock) via an error in `wait_for_parent_stdin_hijack()` which, for ex., can happen in debug mode via crash handling of a `MsgTypeError` received from the root during a codec applied msg-spec race! Originally I was attempting to solve this by making the SIGINT override handler more resilient but this case is somewhat impossible to detect by an external root task other then checking for duplicate ownership via the new `subactor_task_uid`. => SO NOW, we always stick the current task uid in the `Lock._blocked: set` and raise an rte on a double request by the same remote task. Included is a variety of small refinements: - finally figured out how to mark a variety of `.__exit__()` frames with `pdbp.hideframe()` to actually hide them B) - add cls methods around managing `Lock._locking_task_cs` from root only. - re-org all the `Lock` attrs into those only used in root vs. subactors and proto-prep a new `DebugStatus` actor-singleton to be used in subs. - add a `Lock.repr()` to contextually print the current conc primitives. - rename our `Pdb`-subtype to `PdbREPL`. - rigor out the SIGINT handler a bit, originally to try and hack-solve the double-lock issue mentioned above, but now just with better logging and logic for most (all?) possible hang cases that should be hang-recoverable after enough ctrl-c mashing by the user.. well hopefully: - using `Lock.repr()` for both root and sub cases. - lots more `log.warn()`s and handler reversions on stale lock or cs detection. - factor `._pause()` impl a little better moving the actual repl entry to a new `_enter_repl_sync()` (originally for easier wrapping in the sub case with `apply_codec()`).Breaks out all the (sub)actor local conc primitives from `Lock` (which is now only used in and by the root actor) such that there's an explicit distinction between a task that's "consuming" the `Lock` (remotely) vs. the root-side service tasks which do the actual acquire on behalf of the requesters. `DebugStatus` changeover deats: ------ - ------ - move all the actor-local vars over `DebugStatus` including: - move `_trio_handler` and `_orig_sigint_handler` - `local_task_in_debug` now `repl_task` - `_debugger_request_cs` now `req_cs` - `local_pdb_complete` now `repl_release` - drop all ^ fields from `Lock.repr()` obvi.. - move over the `.[un]shield_sigint()` and `.is_main_trio_thread()` methods. - add some new attrs/meths: - `DebugStatus.repl` for the currently running `Pdb` in-actor singleton. - `.repr()` for pprint of state (like `Lock`). - Note: that even when a root-actor task is in REPL, the `DebugStatus` is still used for certain actor-local state mgmt, such as SIGINT handler shielding. - obvi change all lock-requester code bits to now use a `DebugStatus` in their local actor-state instead of `Lock`, i.e. change usage from `Lock` in `._runtime` and `._root`. - use new `Lock.get_locking_task_cs()` API in when checking for sub-in-debug from `._runtime.Actor._stream_handler()`. Unrelated to topic-at-hand tweaks: ------ - ------ - drop the commented bits about hiding `@[a]cm` stack frames from `_debug.pause()` and simplify to only one block with the `shield` passthrough since we already solved the issue with cancel-scopes using `@pdbp.hideframe` B) - this includes all the extra logging about the extra frame for the user (good thing i put in that wasted effort back then eh..) - put the `try/except BaseException` with `log.exception()` around the whole of `._pause()` to ensure we don't miss in-func errors which can cause hangs.. - allow passing in `portal: Portal` to `Actor.start_remote_task()` such that `Portal` task spawning methods are always denoted correctly in terms of `Context.side`. - lotsa logging tweaks, decreasing a bit of noise from `.runtime()`s.77a15ebuse `DebugStatus` in `._rpc`As per some newly added features and APIs: - pass `portal: Portal` to `Actor.start_remote_task()` from `open_context_from_portal()` marking `Portal.open_context()` as always being the "parent" task side. - add caller tracing via `.devx._code.CallerInfo/.find_caller_info()` called in `mk_context()` and (for now) a `__runtimeframe__: int = 2` inside `open_context_from_portal()` such that any enter-er of `Portal.open_context()` will be reported. - pass in a new `._caller_info` attr which is used in 2 new meths: - `.repr_caller: str` for showing the name of the app-code-func. - `.repr_api: str` for showing the API ep, which for now we just hardcode to `Portal.open_context()` since ow its gonna show the mod func name `open_context_from_portal()`. - use those new props ^ in the `._deliver_msg()` flow body log msg content for much clearer msg-flow tracing Bo - add `Context._cancel_on_msgerr: bool` to toggle whether a delivered `MsgTypeError` should trigger a `._scope.cancel()` call. - also (temporarily) add separate `.cancel()` emissions for both cases as i work through hacking out the maybe `MsgType.pld: Raw` support.As per much tinkering, re-designs and preceding rubber-ducking via many "commit msg novelas", **finally** this adds the (hopefully) final missing layer for typed msg safety: `tractor.msg._ops.PldRx` (or `PayloadReceiver`? haven't decided how verbose to go..) Design justification summary: ------ - ------ - need a way to be as-close-as-possible to the `tractor`-application such that when `MsgType.pld: PayloadT` validation takes place, it is straightforward and obvious how user code can decide to handle any resulting `MsgTypeError`. - there should be a common and optional-yet-modular way to modify **how** data delivered via IPC (possibly embedded as user defined, type-constrained `.pld: msgspec.Struct`s) can be handled and processed during fault conditions and/or IPC "msg attacks". - support for nested type constraints within a `MsgType.pld` field should be simple to define, implement and understand at runtime. - a layer between the app-level IPC primitive APIs (`Context`/`MsgStream`) and application-task code (consumer code of those APIs) should be easily customized and prove-to-be-as-such through demonstrably rigorous internal (sub-sys) use! -> eg. via seemless runtime RPC eps support like `Actor.cancel()` -> by correctly implementing our `.devx._debug.Lock` REPL TTY mgmt dialog prot, via a dead simple payload-as-ctl-msg-spec. There are some fairly detailed doc strings included so I won't duplicate that content, the majority of the work here is actually somewhat of a factoring of many similar blocks that are doing more or less the same `msg = await Context._rx_chan.receive()` with boilerplate for `Error`/`Stop` handling via `_raise_from_no_key_in_msg()`. The new `PldRx` basically provides a shim layer for this common "receive msg, decode its payload, yield it up to the consuming app task" by pairing the RPC feeder mem-chan with a msg-payload decoder and expecting IPC API internals to use **one** API instead of re-implementing the same pattern all over the place XD `PldRx` breakdown ------ - ------ - for now only expects a `._msgdec: MsgDec` which allows for override-able `MsgType.pld` validation and most obviously used in the impl of `.dec_msg()`, the decode message method. - provides multiple mem-chan receive options including: |_ `.recv_pld()` which does the e2e operation of receiving a payload item. |_ a sync `.recv_pld_nowait()` version. |_ a `.recv_msg_w_pld()` which optionally allows retreiving both the shuttling `MsgType` as well as it's `.pld` body for use cases where info on both is important (eg. draining a `MsgStream`). Dirty internal changeover/implementation deatz: ------ - ------ - obvi move over all the IPC "primitives" that previously had the duplicate recv-n-yield logic: - `MsgStream.receive[_nowait]()` delegating instead to the equivalent `PldRx.recv_pld[_nowait]()`. - add `Context._pld_rx: PldRx`, created and passed in by `mk_context()`; use it for the `.started()` -> `first: Started` retrieval inside `open_context_from_portal()`. - all the relevant `Portal` invocation methods: `.result()`, `.run_from_ns()`, `.run()`; also allows for dropping `_unwrap_msg()` and `.Portal_return_once()` outright Bo - rename `Context.ctx._recv_chan` -> `._rx_chan`. - add detailed `Context._scope` info for logging whether or not it's cancelled inside `_maybe_cancel_and_set_remote_error()`. - move `._context._drain_to_final_msg()` -> `._ops.drain_to_final_msg()` since it's really not necessarily ctx specific per say, and it does kinda fit with "msg operations" more abstractly ;)Since i was running into them (internal errors) during lock request machinery dev and was getting all sorts of difficult to understand hangs whenever i intro-ed a bug to either side of the ipc ctx; this all while trying to get the msg-spec working for `Lock` requesting subactors.. Deats: - hideframes for `@acm`s and `trio.Event.wait()`, `Lock.release()`. - better detail out the `Lock.acquire/release()` impls - drop `Lock.remote_task_in_debug`, use new `.ctx_in_debug`. - add a `Lock.release(force: bool)`. - move most of what was `_acquire_debug_lock_from_root_task()` and some of the `lock_tty_for_child().__a[enter/exit]()` logic into `Lock.[acquire/release]()` including bunch more logging. - move `lock_tty_for_child()` up in the module to below `Lock`, with some rework: - drop `subactor_uid: tuple` arg since we can just use the `ctx`.. - add exception handler blocks for reporting internal (impl) errors and always force release the lock in such cases. - extend `DebugStatus` (prolly will rename to `DebugRequest` btw): - add `.req_ctx: Context` for subactor side. - add `.req_finished: trio.Event` to sub to signal request task exit. - extend `.shield_sigint()` doc-str. - add `.release()` to encaps all the state mgmt previously strewn about inside `._pause()`.. - use new `DebugStatus.release()` to replace all the duplication: - inside `PdbREPL.set_[continue/quit]()`. - inside `._pause()` for the subactor branch on internal repl-invocation error cases, - in the `_enter_repl_sync()` closure on error, - replace `apply_debug_codec()` -> `apply_debug_pldec()` in tandem with the new `PldRx` sub-sys which handles the new `__pld_spec__`. - add a new `pformat_cs()` helper orig to help debug cs stack a corruption; going to move to `.devx.pformat` obvi. - rename `wait_for_parent_stdin_hijack()` -> `request_root_stdio_lock()` with improvements: - better doc-str and add todos, - use `DebugStatus` more stringently to encaps all subactor req state. - error handling blocks for cancellation and straight up impl errors directly around the `.open_context()` block with the latter doing a `ctx.cancel()` to avoid hanging in the shielded `.req_cs` scope. - similar exc blocks for the func's overall body with explicit `log.exception()` reporting. - only set the new `DebugStatus.req_finished: trio.Event` in `finally`. - rename `mk_mpdb()` -> `mk_pdb()` and don't cal `.shield_sigint()` implicitly since the caller usage does matter for this. - factor out `any_connected_locker_child()` from the SIGINT handler. - rework SIGINT handler to better handle any stale-lock/hang cases: - use new `Lock.ctx_in_debug: Context` to detect subactor-in-debug. and use it to cancel any lock request instead of the lower level - use `problem: str` summary approach to log emissions. - rework `_pause()` given all of the above, stuff not yet mentioned: - don't take `shield: bool` input and proxy to `debug_func()` (for now). - drop `extra_frames_up_when_async: int` usage, expect `**debug_func_kwargs` to passthrough an `api_frame: Frametype` (more on this later). - lotsa asserts around the request ctx vs. task-in-debug ctx using new `current_ipc_ctx()`. - asserts around `DebugStatus` state. - rework and simplify the `debug_func` hooks, `_set_trace()`/`_post_mortem()`: - make them accept a non-optional `repl: PdbRepl` and `api_frame: FrameType` which should be used to set the current frame when the REPL engages. - always hide the hook frames. - always accept a `tb: TracebackType` to `_post_mortem()`. |_ copy and re-impl what was the delegation to `pdbp.xpm()`/`pdbp.post_mortem()` and instead call the underlying `Pdb.interaction()` ourselves with a `caller_frame` and tb instance. - adjust the public `.pause()` impl: - accept optional `hide_tb` and `api_frame` inputs. - mask opening a cancel-scope for now (can cause `trio` stack corruption, see notes) and thus don't use the `shield` input other then to eventually passthrough to `_post_mortem()`? |_ thus drop `task_status` support for now as well. |_ pretty sure correct soln is a debug-nursery around `._invoke()`. - since no longer using `extra_frames_up_when_async` inside `debug_func()`s ensure all public apis pass a `api_frame`. - re-impl our `tractor.post_mortem()` to directly call into `._pause()` instead of binding in via `partial` and mk it take similar input as `.pause()`. - drop `Lock.release()` from `_maybe_enter_pm()`, expose and pass expected frame and tb. - use necessary changes from all the above within `maybe_wait_for_debugger()` and `acquire_debug_lock()`. Lel, sorry thought that would be shorter.. There's still a lot more re-org to do particularly with `DebugStatus` encapsulation but it's coming in follow up.- set `._state._ctxvar_Context` just after `StartAck` inside `open_context_from_portal()` so that `current_ipc_ctx()` always works on the 'parent' side. - always set `.canceller` to any `MsgTypeError.src_uid` and otherwise to any maybe-detected `.src_uid` (i.e. for RAEs). - always set `.canceller` to us when we rx a ctxc which reports us as its canceller; this is a sanity check on definite "self cancellation". - adjust `._is_self_cancelled()` logic to only be `True` when `._remote_error` is both a ctxc with a `.canceller` set to us AND when `Context.canceller` is also set to us (since the change above) as a little bit of extra rigor. - fill-in/fix some `.repr_state` edge cases: - merge self-vs.-peer ctxc cases to one block and distinguish via nested `._is_self_cancelled()` check. - set 'errored' for all exception matched cases despite `.canceller`. - add pre-`Return` phase statuses: |_'pre-started' and 'syncing-to-child' depending on side and when `._stream` has not (yet) been set. |_'streaming' and 'streaming-finished' depending on side when `._stream` is set and whether it was stopped/closed. - tweak drainage log-message to use "outcome" instead of "result". - use new `.devx.pformat.pformat_cs()` inside `_maybe_cancel_and_set_remote_error()` but, IFF the log level is at least 'cancel'.Proto-ing a little suite of call-stack-frame annotation-for-scanning sub-systems for the purposes of both, - the `.devx._debug`er and its traceback and frame introspection needs when entering the REPL, - detailed trace-style logging such that we can explicitly report on "which and where" `tractor`'s APIs are used in the "app" code. Deats: - change mod name obvi from `._code` and adjust client mod imports. - using `wrapt` (for perf) implement a `@api_frame` annot decorator which both stashes per-call-stack-frame instances of `CallerInfo` in a table and marks the function such that API endpoints can be easily found via runtime stack scanning despite any internal impl changes. - add a global `_frame2callerinfo_cache: dict[FrameType, CallerInfo]` table for providing the per func-frame info caching. - Re-implement `CallerInfo` to require less (types of) inputs: |_ `_api_func: Callable`, a ref to the (singleton) func def. |_ `_api_frame: FrameType` taken from the `@api_frame` marked `tractor`-API func's runtime call-stack, from which we can determine the app code's `.caller_frame`. |_`_caller_frames_up: int|None` allowing the specific `@api_frame` to determine "how many frames up" the application / calling code is. And, a better set of derived attrs: |_`caller_frame: FrameType` which finds and caches the API-eps calling frame. |_`caller_frame: FrameType` which finds and caches the API-eps calling - add a new attempt at "getting a method ref from its runtime frame" with `get_ns_and_func_from_frame()` using a heuristic that the `CodeType.co_qualname: str` should have a "." in it for methods. - main issue is still that the func-ref lookup will require searching for the method's instance type by name, and that name isn't guaranteed to be defined in any particular ns.. |_rn we try to read it from the `FrameType.f_locals` but that is going to obvi fail any time the method is called in a module where it's type is not also defined/imported. - returns both the ns and the func ref FYI.Mostly the result of the `RemoteActorError.pformat()` and our new `_pause/crash_msg: str`s which include the `trio.Task.__repr__()` in the `log.pdb()` message. Obvi use the `in_prompt_msg()` to accomplish where not used prior. ToDo later: -[ ] still some outstanding questions on how detailed inceptions should look, eg. in `test_multi_nested_subactors_error_through_nurseries()` |_maybe we should be more pedantic at checking `.src_uid` vs. `.relay_uid` fields? -[ ] staged a placeholder test for verifying correct call-stack frame on crash handler REPL entry. -[ ] also need a test to verify that you can't pause from an already paused actor task such as can happen if you try to step through runtime code that has a recurrent entry to `._debug.pause()`.Related to the prior patch, re the new `with_type_header: bool`: - in the `with_type_header == True` use case make sure we keep the first `._message: str` line non-indented since it'll show just after the header-line's type path with ':'. - when `False` drop the `)>` `repr()`-instance style as well so that we just get the ascii boxed traceback as though it's the error message-`str` not the `repr()` of the error obj. Other, - hide `pack_from_raise()` call frame since it'll show in debug mode crash handling.. - mk `MsgTypeError.from_decode()` explicitly accept and proxy an optional `ipc_msg` and change `msgdict` to also be optional, only reading out the `**extra_msgdata` when provided. - expose a `_mk_msg_type_err(src_err_msg: Error|None = None,)` for callers who which to inject a `._ipc_msg: Msgtype` to the MTE. |_ add a note how we can't use it due to a causality-dilemma when pld validating `Started` on the send side..Such that if caught by user code and/or the runtime we can introspect the original msg which caused the type error. Previously this was kinda half-baked with a `.msg_dict` which was delivered from an `Any`-decode of the shuttle msg in `_mk_msg_type_err()` but now this more explicitly refines the API and supports both `PayloadMsg`-instance or the msg-dict style injection: - allow passing either of `bad_msg: PayloadMsg|None` or `bad_msg_as_dict: dict|None` to `MsgTypeError.from_decode()`. - expose public props for both ^ whilst dropping prior `.msgdict`. - rework `.from_decode()` to explicitly accept `**extra_msgdata: dict` |_ only overriding it from any `bad_msg_as_dict` if the keys are found in `_ipcmsg_keys`, **except** for `_bad_msg` when `bad_msg` is passed. |_ drop `.ipc_msg` passthrough. |_ drop `msgdict` input. - adjust `.cid` to only pull from the `.bad_msg` if set. Related fixes/adjustments: - `pack_from_raise()` should pull `boxed_type_str` from `boxed_type.__name__`, not the `type()` of it.. also add a `hide_tb: bool` flag. - don't include `_msg_dict` and `_bad_msg` in the `_body_fields` set. - allow more granular boxed traceback-str controls: |_ allow passing a `tb_str: str` explicitly in which case we use it verbatim and presume caller knows what they're doing. |_ when not provided, use the more explicit `traceback.format_exception(exc)` since the error instance is a required input (we still fail back to the old `.format_exc()` call if for some reason the caller passes `None`; but that should be a bug right?). |_ if a `tb: TracebackType` and a `tb_str` is passed, concat them. - in `RemoteActorError.pformat()` don't indent the `._message` part used for the `body` when `with_type_header == False`. - update `_mk_msg_type_err()` to use `bad_msg`/`bad_msg_as_dict` appropriately and drop passing `ipc_msg`.Originally discovered as while using `tractor.pause_from_sync()` from the `i3ipc` client running in a bg-thread that uses `asyncio` inside `modden`. Turns out we definitely aren't correctly handling `.pause_from_sync()` from the root actor when called from a `trio.to_thread.run_sync()` bg thread: - root-actor bg threads which can't `Lock._debug_lock.acquire()` since they aren't in `trio.Task`s. - even if scheduled via `.to_thread.run_sync(_debug._pause)` the acquirer won't be the task/thread which calls `Lock.release()` from `PdbREPL` hooks; this results in a RTE raised by `trio`.. - multiple threads will step on each other's stdio since cpython's GIL seems to ctx switch threads on every input from the user to the REPL loop.. Reproduce via reworking our example and test so that they catch and fail for all edge cases: - rework the `/examples/debugging/sync_bp.py` example to demonstrate the above issues, namely the stdio clobbering in the REPL when multiple threads and/or a subactor try to debug simultaneously. |_ run one thread using a task nursery to ensure it runs conc with the nursery's parent task. |_ ensure the bg threads run conc a subactor usage of `.pause_from_sync()`. |_ gravely detail all the special cases inside a TODO comment. |_ add some control flags to `sync_pause()` helper and don't use `breakpoint()` by default. - extend and adjust `test_debugger.test_pause_from_sync` to match (and thus currently fail) by ensuring exclusive `PdbREPL` attachment when the 2 bg root-actor threads are concurrently interacting alongside the subactor: |_ should only see one of the `_pause_msg` logs at a time for either one of the threads or the subactor. |_ ensure each attaches (in no particular order) before expecting the script to exit. Impl adjustments to `.devx._debug`: - drop `Lock.repl`, no longer used. - add `Lock._owned_by_root: bool` for the `.ctx_in_debug == None` root-actor-task active case. - always `log.exception()` for any `._debug_lock.release()` ownership RTE emitted by `trio`, like we used to.. - add special `Lock.release()` log message for the stale lock but `._owned_by_root == True` case; oh yeah and actually `log.devx(message)`.. - rename `Lock.acquire()` -> `.acquire_for_ctx()` since it's only ever used from subactor IPC usage; well that and for local root-task usage we should prolly add a `.acquire_from_root_task()`? - buncha `._pause()` impl improvements: |_ type `._pause()`'s `debug_func` as a `partial` as well. |_ offer `called_from_sync: bool` and `called_from_bg_thread: bool` for the special case handling when called from `.pause_from_sync()` |_ only set `DebugStatus.repl/repl_task` when `debug_func != None` (OW ensure the `.repl_task` is not the current one). |_ handle error logging even when `debug_func is None`.. |_ lotsa detailed commentary around root-actor-bg-thread special cases. - when `._set_trace(hide_tb=False)` do `pdbp.set_trace(frame=currentframe())` so the `._debug` internal frames are always included. - by default always hide tracebacks for `.pause[_from_sync]()` internals. - improve `.pause_from_sync()` to avoid root-bg-thread crashes: |_ pass new `called_from_xxx_` flags and ensure `DebugStatus.repl_task` is actually set to the `threading.current_thread()` when needed. |_ manually call `Lock._debug_lock.acquire_nowait()` for the non-bg thread case. |_ TODO: still need to implement the bg-thread case using a bg `trio.Task`-in-thread with an `trio.Event` set by thread REPL exit.Functionally working for multi-threaded (via cpython threads spawned from `to_trio.to_thread.run_sync()`) alongside subactors, tested (for now) only with threads started inside the root actor (which seemed to have the most issues in terms of the impl and special cases..) using the new `tractor.pause_from_sync()` API! Main implementation changes to `.pause_from_sync()` ------ - ------ - from the root actor, we need to ensure bg thread case is handled *specially* since no IPC is used to request the TTY stdio mutex and `Lock` (API) usage is conducted entirely from a local task or thread; dedicated `Lock` usage for the root-actor already is branched inside `._pause()` and needs similar handling from a root bg-thread: |_for the special case of a root bg thread we need to `trio`-main-thread schedule a bg task inside a new `_pause_from_bg_root_thread()`. The new task needs to implement most of what was is handled inside `._pause()` manually, mostly because in this root-actor-bg-thread case we have 2 constraints: 1. to enter `PdbREPL.interaction()` **from the bg thread** directly, 2. the task that `Lock._debug_lock.acquire()`s has to be the same that calls `.release() (a `trio.FIFOLock` constraint) |_impl deats of this `_pause_from_bg_root_thread()` include: - (for now) calling `._pause()` to acquire the `Lock._debug_lock`. - setting its own `DebugStatus.repl_release`. - calling `.DebugStatus.shield_sigint()` to ensure the root's main thread uses the right handler when the bg one is REPL-ing. - wait manually on the `.repl_release()` to be set by the thread's dedicated `PdbREPL` exit. - manually calling `Lock.release()` from the **same task** that acquired it. - expect calls to `._pause()` to deliver a `tuple[Task, PdbREPL]` such that we always get the handle both to any newly created REPl instance and the (maybe) the scheduled bg task within which is runs. - add a single `message: str` style to `log.devx()` based on branching style for logging. - ensure both `DebugStatus.repl` and `.repl_task` are set **just before** calling `._set_trace()` to ensure the correct `Task|Thread` is set when the REPL is finally entered from sync code. - add a wrapping caller `_sync_pause_from_builtin()` which passes in the new `called_from_builtin=True` to indicate `breakpoint()` caller usage, obvi pass in `api_frame`. Changes to `._pause()` in support of ^ ------ - ------ - `TaskStatus.started()` and return the `tuple[Task, PdbREPL]` to callers / starters. - only call `DebugStatus.shield_sigint()` when no `repl` passed bc some callers (like bg threads) may need to apply it at some specific point themselves. - tweak some asserts for the `debug_func == None` / non-`trio`-thread case. - add a mod-level `_repl_fail_msg: str` to be used when there's an internal `._pause()` failure for testing, easier to pexpect match. - more comprehensive logging for the root-actor branched case to (attempt to) indicate any of the 3 cases: - remote ctx from subactor has the `Lock`, - already existing root task or thread has it or, - some kinda stale `.locked()` situation where the root has the lock but we don't know why. - for root usage, revert to always `await Lock._debug_lock.acquire()`-ing despite `called_from_sync` since `.pause_from_sync()` was reworked to instead handle the special bg thread case in the new `_pause_from_bg_root_thread()` task. - always do `return _enter_repl_sync(debug_func)`. - try to report any `repl_task: Task|Thread` set by the caller (particularly for the bg thread cases) as being the thread or task `._pause()` was called "on behalf of" Changes to `DebugStatus`/`Lock` in support of ^ ------ - ------ - only call `Lock.release()` from `DebugStatus.set_[quit/continue]()` when called from the main `trio` thread and always call `DebugStatus.release()` **after** to ensure `.repl_released()` is set **after** `._debug_lock.release()`. - only call `.repl_release.set()` from `trio` thread otherwise use `.from_thread.run()`. - much more refinements in `Lock.release()` for threading cases: - return `bool` to indicate whether lock was released by caller. - mask (in prep to drop) `_pause()` usage of `Lock.release.force=True)` since forcing a release can't ever avoid the RTE from `trio`.. same task **must** acquire/release. - don't allow usage from non-`trio`-main-threads, ever; there's no point since the same-task-needs-to-manage-`FIFOLock` constraint. - much more detailed logging using `message`-building-style for all caller (edge) cases. |_ use a `we_released: bool` to determine failed-to-release edge cases which can happen if called from bg threads, ensure we `log.exception()` on any incorrect usage resulting in release failure. |_ complain loudly if the release fails and some other task/thread still holds the lock. |_ be explicit about "who" (which task or thread) the release is "on behalf of" by reading `DebugStatus.repl_task` since the caller isn't the REPL operator in many sync cases. - more or less drop `force` support, as mentioned above. - ensure we unset `._owned_by_root` if the caller is a root task. Other misc ------ - ------ - rename `lock_tty_for_child()` -> `lock_stdio_for_peer()`. - rejig `Lock.repr()` to show lock and event stats. - stage `Lock.stats` and `.owner` methods in prep for doing a singleton instance and `@property`s.The tests only use one input spec (conveniently) so there's not much to change in the logic, - only pass the `maybe_msg_spec` to the child-side decorator and obvi drop the surrounding `msgops.limit_plds()` block in the child. - tweak a few `MsgDec` asserts, mostly dropping the `msg._ops._def_any_spec` state checks since the child-side won't have any pre pld-spec state given the runtime now applies the `pld_spec` before running the task's func body. - also allowed dropping the `finally:` which did a similar check outside the `.limit_plds()` block.goodboy referenced this pull request from pikers/piker2025-02-18 22:06:10 +00:00
3d12a7e005to5a0524641a5a0524641ato904d8ce8ffruntime_to_msgspec..to Rework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sRework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sto Rework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sRework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sto Rework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sRework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sto Rework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`s904d8ce8ffto4b92e14c92Rework entire runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sto Rework low-level-runtime to enforce a `msgspec`-defined, SC-supervision-protocol for IPC `Context`sWell if @guille approves..