Turns out we weren't despite the optional `stream_handler_nursery` input
to `Server.listen_on()`; fail over to the `Server._stream_handler_tn`
allocated during server setup in those cases.
Turns out I didn't read my own internals docs/comments and despite it
not being used previously, this adds the real use case: a root,
per-actor, scope which ensures parent comms are the last conc-thing to
be cancelled.
Also, the impl changes here make the test from 6410e45 (or wtv
it's rebased to) pass, i.e. we can support crash handling in the root
actor despite the root-tn having been (self) cancelled.
Superficial adjustments,
- rename `Actor._service_n` -> `._service_tn` everywhere.
- add asserts to `._runtime.async_main()` which ensure that the any
`.trionics.maybe_open_nursery()` calls against optionally passed
`._[root/service]_tn` are allocated-if-not-provided (the
`._service_tn`-case being an i-guess-prep-for-the-future-anti-pattern
Bp).
- obvi adjust all internal usage to match new naming.
Serious/real-use-case changes,
- add (back) a `Actor._root_tn` which sits a scope "above" the
service-tn and is either,
+ assigned in `._runtime.async_main()` for sub-actors OR,
+ assigned in `._root.open_root_actor()` for the root actor.
**THE primary reason** to keep this "upper" tn is that during
a full-`Actor`-cancellation condition (more details below) we want to
ensure that the IPC connection with a sub-actor's parent is **the last
thing to be cancelled**; this is most simply implemented by ensuring
that the `Actor._parent_chan: .ipc.Channel` is handled in an upper
scope in `_rpc.process_messages()`-subtask-terms.
- for the root actor this `root_tn` is allocated in `.open_root_actor()`
body and assigned as such.
- extend `Actor.cancel_soon()` to be cohesive with this entire teardown
"policy" by scheduling a task in the `._root_tn` which,
* waits for the `._service_tn` to complete and then,
* cancels the `._root_tn.cancel_scope`,
* includes "sclangy" console logging throughout.
Such that we audit the `shield=root_tn.cancel_scope.cancel_called,`
passed to `await debug._maybe_enter_pm()` in the `open_root_actor()`
exit handler block.
I masked it bc it doesn't seem to actually work for the case I was
testing (`emsd` clobbering a `paperboi` in `piker`..) but figured I'd
leave it as a reminder for solving this problem more generally (#320)
since this is likely the place in the code for a soln.
When i tested it in my case it just resulted in a hang around the `with
debug.acquire_debug_lock()` for some reason? Can't remember if the child
ended up being able to REPL without issue though..
Such that we handle them despite a cancellation condition. This is
almost always the case, that `root_tn.cancel_scope.cancel_called` is
set, by the time the `debug._maybe_enter_pm()` hits. Previous I guess we
just weren't actually ever REPL-debugging such cases?
TODO, still needs a test obvi!
That is the `target` can declare a `chan: LinkedTaskChannel` instead of
`to_trio`/`from_aio`.
To support it,
- change `.started()` -> the more appropriate `.started_nowait()` which
can be called sync from the aio child task.
- adjust the `provide_channels` assert to accept either fn sig
declaration (for now).
Still needs test(s) obvi..
Makes the newly added `test_aio_side_raises_before_started` test pass by
ensuring errors raised by any `.to_asyncio.open_channel_from()` spawned
child-`asyncio.Task` are relayed by any caught `trio.EndOfChannel` by
checking for a new `LinkedTaskChannel._closed_by_aio_task: bool`.
Impl deats,
- obvi add `LinkedTaskChannel._closed_by_aio_task: bool = False`
- in `translate_aio_errors()` always check for the new flag on EOC
conditions and in such cases set `chan._trio_to_raise = aio_err` such
that the `trio`-parent-task always raises the child's exception
directly, OW keep original EoC passthrough in place.
- include *very* detailed per-case comments around the extended handler.
- adjust re-raising logic with a new `raise_from` where we only give the
`aio_err` priority if it's not already set as to `trio_to_raise`.
Also,
- hide the `_run_asyncio_task()` frame by def.
Verifying that if any exc is raised pre `chan.send_nowait()` (our
currentlly shite version of a `chan.started()`) then that exc is indeed
raised through on the `trio`-parent task side. This case was reproduced
from a `piker.brokers.ib` issue with a similar embedded
`.trionics.maybe_open_context()` call.
Deats,
- call the suite `test_aio_side_raises_before_started`.
- mk the `@context` simply `maybe_open_context(acm_func=open_channel_from)`
with a `target=raise_before_started` which,
- simply sleeps then immediately raises a RTE.
- expect the RTE from the aio-child-side to propagate all the way up to
the root-actor's task right up through the `trio.run()`.
Been meaning to do this forever and a recent test hang finally drove me
to it Bp
Like it sounds, adopt the "cancel-status" properties on `ActorNursery`
use already on our `Context` and derived from `trio.CancelScope`:
- add new private `._cancel_called` (set in the head of `.cancel()`)
& `._cancelled_caught` (set in the tail) instance vars with matching
read-only `@properties`.
- drop the instance-var and instead delegate a `.cancelled: bool`
property to `._cancel_called` and add a usage deprecation warning
(since removing it breaks a buncha tests).
So i need to either adjust the tests or figure out if/why this is needed
to avoid the crashing in `pikerd` i found when killin the chart during
a long backfill with `binance` backend..
Just like in the BRE case (for UDS) it seems when a peer closes the
(UDS?) socket `trio` instead raises a `ClosedResourceError` which we now
catch and re-raise as a `TransportClosed`. This again results in
`tpt.send()` calls from the rpc-runtime **not** raising when it's known
that the IPC channel is disconnected.
Shift around comments and expressions for better reading, assign
`tpt_closed` for easier introspection from REPL during debug oh and fix
the `MsgpackTransport.pformat()` to render '|_peers: 1' .. XD
Similar to what was just changed for `Context.repr_state`, when the
child task is cancelled but by a different "layer" of the runtime (i.e.
a `Portal.cancel_actor()` / `SIGINT`-to-process canceller) we don't
dump a traceback instead just `log.cancel()` emit.
Such that if the local task hasn't resolved but is `trio.Cancelled` and
a `.canceller` was set, we report a `'actor-cancelled'` from
`.repr_state: str`. Bit of formatting to avoid needless newlines too!
Since it's merely a local-file-sys subdirectory and there should be no
reason file creation conflicts with other bind spaces.
Also add 2 test suites to match,
- `tests/ipc/test_each_tpt::test_uds_bindspace_created_implicitly` to
verify the dir creation when DNE.
- `..test_uds_double_listen_raises_connerr` to ensure a double bind
raises a `ConnectionError` from the src `OSError`.
Using `.get_stream_addrs()` such that we always (*can*) assign the peer
end's PID in the `._raddr`.
Also factor common `ConnectionError` re-raising into
a `_reraise_as_connerr()`-@cm.
Such that the caller can be responsible for their own (nursery) scoping
as needed and, for the latter fn's case with
a `trio.Nursery.CancelStatus.encloses()` check to ensure the `tn` is
a valid parent-ish.
Some deats,
- in `gather_contexts()`, mv the `try/finally` outside the nursery block
to ensure we always do the `parent_exit`.
- for `maybe_open_context()` we do a naive task-tree hierarchy audit to
ensure the provided scope is not *too* child-ish (with what APIs `trio`
gives us, see above), OW go with the old approach of using the actor's
private service nursery.
Also,
* better report `trio.Cancelled` around the cache-miss `yield`
cases and ensure we **never** unmask triggering key-errors.
* report on any stale-state with the mutex in the `finally` block.
Call it `test_lock_not_corrupted_on_fast_cancel()` and includes
a detailed doc string to explain. Implemented it "cleverly" by having
the target `@acm` cancel its parent nursery after a peer, cache-hitting
task, is already waiting on the task mutex release.
Since the `await service_n.start()` on key-err can be cancel-masked
(checkpoint interrupted before `_Cache.run_ctx` completes), we need to
always `lock.release()` in to avoid lock-owner-state corruption and/or
inf-hangs in peer cache-hitting tasks.
Deats,
- add a `try/except/finally` around the key-err triggered cache-miss
`service_n.start(_Cache.run_ctx, ..)` call, reporting on any taskc
and always `finally` unlocking.
- fill out some log msg content and use `.debug()` level.
Here I was thinking the bcaster (usage) maybe required a rework but,
NOPE it's just bc a checkpoint was needed in the parent task owning the
`tn` which spawns `get_sub_and_pull()` tasks to ensure the bg allocated
`an`/portal is eventually cancel-called..
Ah well, at least i started a patch for `MsgStream.subscribe()` to make
it multicast revertible.. XD
Anyway, I tossed in some checks & notes related to all that unnecessary
effort since I do think i'll move forward implementing it:
- for the `cache_hit` case always verify that the `bcast` clone is
unregistered from the common state subs after
`.subscribe().__aexit__()`.
- do a light check that the implicit `MsgStream._broadcaster` is always
the only bcrx instance left-leaked into that state.. that is until
i get the proper de-allocation/reversion from multicast -> unicast
working.
- put in mega detailed note about the required parent-task checkpoint.
Since I recently discovered a very subtle race-case that can sometimes
cause the suite to hang, seemingly due to the `an: ActorNursery`
allocated *behind* the `.trionics.maybe_open_context()` usage; this can
result in never cancelling the 'streamer' subactor despite the `main()`
timeout-guard?
This led me to dig in and find that the underlying issue was 2-fold,
- our `BroadcastReceiver` termination-mgmt semantics in
`MsgStream.subscribe()` can result in the first subscribing task to
always keep the `MsgStream._broadcaster` instance allocated; it's
never `.aclose()`ed, which makes it tough to determine (and thus
trace) when all subscriber-tasks are actually complete and
exited-from-`.subscribe()`..
- i was shield waiting `.ipc._server.Server.wait_for_no_more_peers()` in
`._runtime.async_main()`'s shutdown sequence which would then compound
the issue resulting in a SIGINT-shielded hang.. the worst kind XD
Actual changes here are just styling, printing, and some mucking with
passing the `an`-ref up to the parent task in the root-actor where i was
doing a conditional `ActorNursery.cancel()` to mk sure that was actually
the problem. Presuming this is fixed the `.pause()` i left unmasked
should never hit.
As mentioned in prior testing commit, it can cause the worst kind of
hangs, the SIGINT ignoring kind.. Pretty sure there was never any reason
outside some esoteric multi-actor debugging case, and pretty sure that
already was solved?