Originally discovered as while using `tractor.pause_from_sync()`
from the `i3ipc` client running in a bg-thread that uses `asyncio`
inside `modden`.
Turns out we definitely aren't correctly handling `.pause_from_sync()`
from the root actor when called from a `trio.to_thread.run_sync()`
bg thread:
- root-actor bg threads which can't `Lock._debug_lock.acquire()` since
they aren't in `trio.Task`s.
- even if scheduled via `.to_thread.run_sync(_debug._pause)` the
acquirer won't be the task/thread which calls `Lock.release()` from
`PdbREPL` hooks; this results in a RTE raised by `trio`..
- multiple threads will step on each other's stdio since cpython's GIL
seems to ctx switch threads on every input from the user to the REPL
loop..
Reproduce via reworking our example and test so that they catch and fail
for all edge cases:
- rework the `/examples/debugging/sync_bp.py` example to demonstrate the
above issues, namely the stdio clobbering in the REPL when multiple
threads and/or a subactor try to debug simultaneously.
|_ run one thread using a task nursery to ensure it runs conc with the
nursery's parent task.
|_ ensure the bg threads run conc a subactor usage of
`.pause_from_sync()`.
|_ gravely detail all the special cases inside a TODO comment.
|_ add some control flags to `sync_pause()` helper and don't use
`breakpoint()` by default.
- extend and adjust `test_debugger.test_pause_from_sync` to match (and
thus currently fail) by ensuring exclusive `PdbREPL` attachment when
the 2 bg root-actor threads are concurrently interacting alongside the
subactor:
|_ should only see one of the `_pause_msg` logs at a time for either
one of the threads or the subactor.
|_ ensure each attaches (in no particular order) before expecting the
script to exit.
Impl adjustments to `.devx._debug`:
- drop `Lock.repl`, no longer used.
- add `Lock._owned_by_root: bool` for the `.ctx_in_debug == None`
root-actor-task active case.
- always `log.exception()` for any `._debug_lock.release()` ownership
RTE emitted by `trio`, like we used to..
- add special `Lock.release()` log message for the stale lock but
`._owned_by_root == True` case; oh yeah and actually
`log.devx(message)`..
- rename `Lock.acquire()` -> `.acquire_for_ctx()` since it's only ever
used from subactor IPC usage; well that and for local root-task
usage we should prolly add a `.acquire_from_root_task()`?
- buncha `._pause()` impl improvements:
|_ type `._pause()`'s `debug_func` as a `partial` as well.
|_ offer `called_from_sync: bool` and `called_from_bg_thread: bool`
for the special case handling when called from `.pause_from_sync()`
|_ only set `DebugStatus.repl/repl_task` when `debug_func != None`
(OW ensure the `.repl_task` is not the current one).
|_ handle error logging even when `debug_func is None`..
|_ lotsa detailed commentary around root-actor-bg-thread special cases.
- when `._set_trace(hide_tb=False)` do `pdbp.set_trace(frame=currentframe())`
so the `._debug` internal frames are always included.
- by default always hide tracebacks for `.pause[_from_sync]()` internals.
- improve `.pause_from_sync()` to avoid root-bg-thread crashes:
|_ pass new `called_from_xxx_` flags and ensure `DebugStatus.repl_task`
is actually set to the `threading.current_thread()` when needed.
|_ manually call `Lock._debug_lock.acquire_nowait()` for the non-bg
thread case.
|_ TODO: still need to implement the bg-thread case using a bg
`trio.Task`-in-thread with an `trio.Event` set by thread REPL exit.
It's been a long time prepped and now finally implemented!
Offer a `shield: bool` argument from our async `._debug` APIs:
- `await tractor.pause(shield=True)`,
- `await tractor.post_mortem(shield=True)`
^-These-^ can now be used inside cancelled `trio.CancelScope`s,
something very handy when introspecting complex (distributed) system
tear/shut-downs particularly under remote error or (inter-peer)
cancellation conditions B)
Thanks to previous prepping in a prior attempt and various patches from
the rigorous rework of `.devx._debug` internals around typed msg specs,
there ain't much that was needed!
Impl deats
- obvi passthrough `shield` from the public API endpoints (was already
done from a prior attempt).
- put ad-hoc internal `with trio.CancelScope(shield=shield):` around all
checkpoints inside `._pause()` for both the root-process and subactor
case branches.
Add a fairly rigorous example, `examples/debugging/shielded_pause.py`
with a wrapping `pexpect` test, `test_debugger.test_shield_pause()` and
ensure it covers as many cases as i can think of offhand:
- multiple `.pause()` entries in a loop despite parent scope
cancellation in a subactor RPC task which itself spawns a sub-task.
- a `trio.Nursery.parent_task` which raises, is handled and
tries to enter and unshielded `.post_mortem()`, which of course
internally raises `Cancelled` in a `._pause()` checkpoint, so we catch
the `Cancelled` again and then debug the debugger's internal
cancellation with specific checks for the particular raising
checkpoint-LOC.
- do ^- the latter -^ for both subactor and root cases to ensure we
can debug `._pause()` itself when it tries to REPL engage from
a cancelled task scope Bo
Since turns out we didn't have a single example using that API Bo
The test granular-ly checks all use cases:
- `.post_mortem()` manual calls in both subactor and root.
- ensuring built-in RPC crash handling activates after each manual one
from ^.
- drafted some call-stack frame checking that i commented out for now
since we need to first do ANSI escape code removal due to the
colorization that `pdbp` does by default.
|_ added a TODO with SO link on `assert_before()`.
Also todo-staged a shielded-pause test to match with the already
existing-but-needs-refinement example B)
Mostly the result of the `RemoteActorError.pformat()` and our
new `_pause/crash_msg: str`s which include the `trio.Task.__repr__()`
in the `log.pdb()` message.
Obvi use the `in_prompt_msg()` to accomplish where not used prior.
ToDo later:
-[ ] still some outstanding questions on how detailed inceptions
should look, eg. in `test_multi_nested_subactors_error_through_nurseries()`
|_maybe we should be more pedantic at checking `.src_uid` vs.
`.relay_uid` fields?
-[ ] staged a placeholder test for verifying correct call-stack frame on
crash handler REPL entry.
-[ ] also need a test to verify that you can't pause from an already paused actor task
such as can happen if you try to step through runtime code that has
a recurrent entry to `._debug.pause()`.
Now supports use from any `trio` task, any sync thread started with
`trio.to_thread.run_sync()` AND also via `breakpoint()` builtin API!
The only bit missing now is support for `asyncio` tasks when in infected
mode.. Bo
`greenback` setup/API adjustments:
- move `._rpc.maybe_import_gb()` to -> `devx._debug` and factor out the cached
import checking into a sync func whilst placing the async `.ensure_portal()`
bootstrapping into a new async `maybe_init_greenback()`.
- use the new init-er func inside `open_root_actor()` with the output
predicating whether we override the `breakpoint()` hook.
core `devx._debug` implementation deatz:
- make `mk_mpdb()` only return the `pdp.Pdb` subtype instance since
the sigint unshielding func is now accessible from the `Lock`
singleton from anywhere.
- add non-main thread support (at least for `trio.to_thread` use cases)
to our `Lock` with a new `.is_trio_thread()` predicate that delegates
directly to `trio`'s internal version.
- do `Lock.is_trio_thread()` checks inside any methods which require
special provisions when invoked from a non-main `trio` thread:
- `.[un]shield_sigint()` methods since `signal.signal` usage is only
allowed from cpython's main thread.
- `.release()` since `trio.StrictFIFOLock` can only be called from
a `trio` task.
- rework `.pause_from_sync()` itself to directly call `._set_trace()`
and don't bother with `greenback._await()` when we're already calling
it from a `.to_thread.run_sync()` thread, oh and try to use the
thread/task name when setting `Lock.local_task_in_debug`.
- make it an RTE for now if you try to use `.pause_from_sync()` from any
infected-`asyncio` task, but support is (hopefully) coming soon!
For testing we add a new `test_debugger.py::test_pause_from_sync()`
which includes a ctrl-c parametrization around the
`examples/debugging/sync_bp.py` script which includes all currently
supported/working usages:
- `tractor.pause_from_sync()`.
- via `breakpoint()` overload.
- from a `trio.to_thread.run_sync()` spawn.
Since importing from our top level `conftest.py` is not scaleable
or as "future forward thinking" in terms of:
- LoC-wise (it's only one file),
- prevents "external" (aka non-test) example scripts from importing
content easily,
- seemingly(?) can't be used via abs-import if using
a `[tool.pytest.ini_options]` in a `pyproject.toml` vs.
a `pytest.ini`, see:
https://docs.pytest.org/en/8.0.x/reference/customize.html#pyproject-toml)
=> Go back to having an internal "testing" pkg like `trio` (kinda) does.
Deats:
- move generic top level helpers into pkg-mod including the new
`expect_ctxc()` (which i needed in the advanced faults testing script.
- move `@tractor_test` into `._testing.pytest` sub-mod.
- adjust all the helper imports to be a `from tractor._testing import <..>`
Rework `test_ipc_channel_break_during_stream()` and backing script:
- make test(s) pull `debug_mode` from new fixture (which is now
controlled manually from `--tpdb` flag) and drop the previous
parametrized input.
- update logic in ^ test for "which-side-fails" cases to better match
recently updated/stricter cancel/failure semantics in terms of
`ClosedResouruceError` vs. `EndOfChannel` expectations.
- handle `ExceptionGroup`s with expected embedded errors in test.
- better pendantics around whether to expect a user simulated KBI.
- for `examples/advanced_faults/ipc_failure_during_stream.py` script:
- generalize ipc breakage in new `break_ipc()` with support for diff
internal `trio` methods and a #TODO for future disti frameworks
- only make one sub-actor task break and the other just stream.
- use new `._testing.expect_ctxc()` around ctx block.
- add a bit of exception handling with `print()`s around ctxc (unused
except if 'msg' break method is set) and eoc cases.
- don't break parent side ipc in loop any more then once
after first break, checked via flag var.
- add a `pre_close: bool` flag to control whether
`MsgStreama.aclose()` is called *before* any ipc breakage method.
Still TODO:
- drop `pytest.ini` and add the alt section to `pyproject.py`.
-> currently can't get `--rootdir=` opt to work.. not showing in
console header.
-> ^ also breaks on 'tests' `enable_modules` imports in subactors
during discovery tests?
Since this was changed as part of overall project wide logging format
updates, and i ended up changing the both the crash and pause `.pdb()`
msgs to include some multi-line-ascii-"stuff", might as well make the
pre-prompt checks in the test suite more flexible to match.
As such, this exposes 2 new constants inside the `.devx._debug` mod:
- `._pause_msg: str` for the pre `tractor.pause()` header emitted via
`log.pdb()` and,
- `._crash_msg: str` for the pre `._post_mortem()` equiv when handling
errors in debug mode.
Adjust the test suite to use these values and thus make us more capable
to absorb changes in the future as well:
- add a new `in_prompt_msg()` predicate, very similar to `assert_before()`
but minus `assert`s which takes in a `parts: list[str]` to match
in the pre-prompt stdout.
- delegate to `in_prompt_msg()` in `assert_before()` since it was mostly
duplicate minus `assert`.
- adjust all previous `<patt> in before` asserts to instead use
`in_prompt_msg()` with separated pre-prompt-header vs. actor-name
`parts`.
- use new `._pause/crash_msg` values in all such calls including any
`assert_before()` cases.
After many tries I just don't think it's worth it to make the tests work
since the repl UX in `pdbpp` is so unreliable in the latest release and
honestly we're trying to go 3.10+ ASAP.
Further,
- entirely drop the pattern matching inside the `do_ctlc()` for now.
- add a `subactor_error` parametrization that catches a case that
previously caused a hang (when you use 'next' immediately after the
first crash/debug lock (the fix was pushed just before this commit).
If we make it too fast a nursery with debug mode children can cancel
too fast and causes some test failures. It's likely not a huge deal
anyway since the purpose of this poll/check is for human interaction
and the current delay isn't really that noticeable.
Decrease log levels in the debug module to avoid console noise when in
use. Toss in some more detailed comments around the new debugger lock
points.
It turns out recent improvements have made the debugger too good
so we need to just terminate the continue loop in this test when
we finally see the "spawn error" crash out because the breakpoint
forever case will literally, continue forever XD
With the new fixes to the trio spawner we can expect that both root
*and* depth > 1 nursery owning actors will now not clobber any children
that are in debug (either via breakpoint or through crashing). The tests
changed now include more checks which ensure the 2nd level parent-ish
actors also bubble up through into `pdb` and don't kill any of their
(crashed) children before they're done themselves debugging.