I masked it bc it doesn't seem to actually work for the case I was
testing (`emsd` clobbering a `paperboi` in `piker`..) but figured I'd
leave it as a reminder for solving this problem more generally (#320)
since this is likely the place in the code for a soln.
When i tested it in my case it just resulted in a hang around the `with
debug.acquire_debug_lock()` for some reason? Can't remember if the child
ended up being able to REPL without issue though..
Such that we handle them despite a cancellation condition. This is
almost always the case, that `root_tn.cancel_scope.cancel_called` is
set, by the time the `debug._maybe_enter_pm()` hits. Previous I guess we
just weren't actually ever REPL-debugging such cases?
TODO, still needs a test obvi!
That is the `target` can declare a `chan: LinkedTaskChannel` instead of
`to_trio`/`from_aio`.
To support it,
- change `.started()` -> the more appropriate `.started_nowait()` which
can be called sync from the aio child task.
- adjust the `provide_channels` assert to accept either fn sig
declaration (for now).
Still needs test(s) obvi..
Makes the newly added `test_aio_side_raises_before_started` test pass by
ensuring errors raised by any `.to_asyncio.open_channel_from()` spawned
child-`asyncio.Task` are relayed by any caught `trio.EndOfChannel` by
checking for a new `LinkedTaskChannel._closed_by_aio_task: bool`.
Impl deats,
- obvi add `LinkedTaskChannel._closed_by_aio_task: bool = False`
- in `translate_aio_errors()` always check for the new flag on EOC
conditions and in such cases set `chan._trio_to_raise = aio_err` such
that the `trio`-parent-task always raises the child's exception
directly, OW keep original EoC passthrough in place.
- include *very* detailed per-case comments around the extended handler.
- adjust re-raising logic with a new `raise_from` where we only give the
`aio_err` priority if it's not already set as to `trio_to_raise`.
Also,
- hide the `_run_asyncio_task()` frame by def.
Verifying that if any exc is raised pre `chan.send_nowait()` (our
currentlly shite version of a `chan.started()`) then that exc is indeed
raised through on the `trio`-parent task side. This case was reproduced
from a `piker.brokers.ib` issue with a similar embedded
`.trionics.maybe_open_context()` call.
Deats,
- call the suite `test_aio_side_raises_before_started`.
- mk the `@context` simply `maybe_open_context(acm_func=open_channel_from)`
with a `target=raise_before_started` which,
- simply sleeps then immediately raises a RTE.
- expect the RTE from the aio-child-side to propagate all the way up to
the root-actor's task right up through the `trio.run()`.
Been meaning to do this forever and a recent test hang finally drove me
to it Bp
Like it sounds, adopt the "cancel-status" properties on `ActorNursery`
use already on our `Context` and derived from `trio.CancelScope`:
- add new private `._cancel_called` (set in the head of `.cancel()`)
& `._cancelled_caught` (set in the tail) instance vars with matching
read-only `@properties`.
- drop the instance-var and instead delegate a `.cancelled: bool`
property to `._cancel_called` and add a usage deprecation warning
(since removing it breaks a buncha tests).
So i need to either adjust the tests or figure out if/why this is needed
to avoid the crashing in `pikerd` i found when killin the chart during
a long backfill with `binance` backend..
Just like in the BRE case (for UDS) it seems when a peer closes the
(UDS?) socket `trio` instead raises a `ClosedResourceError` which we now
catch and re-raise as a `TransportClosed`. This again results in
`tpt.send()` calls from the rpc-runtime **not** raising when it's known
that the IPC channel is disconnected.
Shift around comments and expressions for better reading, assign
`tpt_closed` for easier introspection from REPL during debug oh and fix
the `MsgpackTransport.pformat()` to render '|_peers: 1' .. XD
Similar to what was just changed for `Context.repr_state`, when the
child task is cancelled but by a different "layer" of the runtime (i.e.
a `Portal.cancel_actor()` / `SIGINT`-to-process canceller) we don't
dump a traceback instead just `log.cancel()` emit.
Such that if the local task hasn't resolved but is `trio.Cancelled` and
a `.canceller` was set, we report a `'actor-cancelled'` from
`.repr_state: str`. Bit of formatting to avoid needless newlines too!
Since it's merely a local-file-sys subdirectory and there should be no
reason file creation conflicts with other bind spaces.
Also add 2 test suites to match,
- `tests/ipc/test_each_tpt::test_uds_bindspace_created_implicitly` to
verify the dir creation when DNE.
- `..test_uds_double_listen_raises_connerr` to ensure a double bind
raises a `ConnectionError` from the src `OSError`.
Using `.get_stream_addrs()` such that we always (*can*) assign the peer
end's PID in the `._raddr`.
Also factor common `ConnectionError` re-raising into
a `_reraise_as_connerr()`-@cm.
Such that the caller can be responsible for their own (nursery) scoping
as needed and, for the latter fn's case with
a `trio.Nursery.CancelStatus.encloses()` check to ensure the `tn` is
a valid parent-ish.
Some deats,
- in `gather_contexts()`, mv the `try/finally` outside the nursery block
to ensure we always do the `parent_exit`.
- for `maybe_open_context()` we do a naive task-tree hierarchy audit to
ensure the provided scope is not *too* child-ish (with what APIs `trio`
gives us, see above), OW go with the old approach of using the actor's
private service nursery.
Also,
* better report `trio.Cancelled` around the cache-miss `yield`
cases and ensure we **never** unmask triggering key-errors.
* report on any stale-state with the mutex in the `finally` block.
Call it `test_lock_not_corrupted_on_fast_cancel()` and includes
a detailed doc string to explain. Implemented it "cleverly" by having
the target `@acm` cancel its parent nursery after a peer, cache-hitting
task, is already waiting on the task mutex release.
Since the `await service_n.start()` on key-err can be cancel-masked
(checkpoint interrupted before `_Cache.run_ctx` completes), we need to
always `lock.release()` in to avoid lock-owner-state corruption and/or
inf-hangs in peer cache-hitting tasks.
Deats,
- add a `try/except/finally` around the key-err triggered cache-miss
`service_n.start(_Cache.run_ctx, ..)` call, reporting on any taskc
and always `finally` unlocking.
- fill out some log msg content and use `.debug()` level.
Here I was thinking the bcaster (usage) maybe required a rework but,
NOPE it's just bc a checkpoint was needed in the parent task owning the
`tn` which spawns `get_sub_and_pull()` tasks to ensure the bg allocated
`an`/portal is eventually cancel-called..
Ah well, at least i started a patch for `MsgStream.subscribe()` to make
it multicast revertible.. XD
Anyway, I tossed in some checks & notes related to all that unnecessary
effort since I do think i'll move forward implementing it:
- for the `cache_hit` case always verify that the `bcast` clone is
unregistered from the common state subs after
`.subscribe().__aexit__()`.
- do a light check that the implicit `MsgStream._broadcaster` is always
the only bcrx instance left-leaked into that state.. that is until
i get the proper de-allocation/reversion from multicast -> unicast
working.
- put in mega detailed note about the required parent-task checkpoint.
Since I recently discovered a very subtle race-case that can sometimes
cause the suite to hang, seemingly due to the `an: ActorNursery`
allocated *behind* the `.trionics.maybe_open_context()` usage; this can
result in never cancelling the 'streamer' subactor despite the `main()`
timeout-guard?
This led me to dig in and find that the underlying issue was 2-fold,
- our `BroadcastReceiver` termination-mgmt semantics in
`MsgStream.subscribe()` can result in the first subscribing task to
always keep the `MsgStream._broadcaster` instance allocated; it's
never `.aclose()`ed, which makes it tough to determine (and thus
trace) when all subscriber-tasks are actually complete and
exited-from-`.subscribe()`..
- i was shield waiting `.ipc._server.Server.wait_for_no_more_peers()` in
`._runtime.async_main()`'s shutdown sequence which would then compound
the issue resulting in a SIGINT-shielded hang.. the worst kind XD
Actual changes here are just styling, printing, and some mucking with
passing the `an`-ref up to the parent task in the root-actor where i was
doing a conditional `ActorNursery.cancel()` to mk sure that was actually
the problem. Presuming this is fixed the `.pause()` i left unmasked
should never hit.
As mentioned in prior testing commit, it can cause the worst kind of
hangs, the SIGINT ignoring kind.. Pretty sure there was never any reason
outside some esoteric multi-actor debugging case, and pretty sure that
already was solved?
Was failing due to the `.fail_after()` timeout being *too short* and
somehow the new interplay of that with strict-exception groups resulting
in the `TooSlowError` never raising but instead an eg with the embedded
`AssertionError`?? I still don't really get it honestly..
I've written up lengthy notes around the different `delay` settings that
can be used to see the diff outcomes, the failing case being the one
i still don't really grok and think is justification for `trio` to
bubble inner `Cancelled`s differently possibly?
For now i've included the original failing case as an `xfail`
parametrization for now which will hopefully drive a follow lowlevel
`trio` test in `test_trioisms`!
Namely `test_empty_mngrs_input_raises()` was failing due to
lazy-iterator use as input to `mngrs` which i guess i added support for
a while back (by it doing a `list(mngrs)` internally)? So just change it
to `gather_contexts(mngrs=())` and also tweak the `trio.fail_after(3)`
since it appears that the prior 1sec was causing
too-fast-of-a-cancellation (before the cluster fully spawned) and thus
the expected `ValueError` never to show..
Also, mask the `tractor.trionics.collapse_eg()` usage (again?) in
`open_actor_cluster()` since it seems unnecessary.
Seems that the way the actor-nursery interacts with the
`.trionics.gather_contexts()` API on cancellation makes our
`.trionics.collapse_eg()` not work as intended?
I need to dig into how `ActorNursery.cancel()` and `.__aexit__()` might
be causing this discrepancy..
Consider this a commit-of-my-index type save for rn.