It was expecting `AssertionError` as a proceed-in-test signal (by
breaking from a continue loop), but `in_prompt_msg(raise_on_err=True)`
was changed to raise `ValueError`; so instead just use as a predicate
for the `break`.
Also rework `in_prompt_msg()` to accept the `child: BaseSpawn` as input
instead of `before: str` remove the casting boilerplate, and adjust all
usage to match.
By re-purposing our `pexpect`-based console matching with a new
`debugging/shield_hang_in_sub.py` example, this tests a few "hanging
actor" conditions more formally:
- that despite a hanging actor's task we can dump
a `stackscope.extract()` tree on relay of `SIGUSR1`.
- the actor tree will terminate despite a shielded forever-sleep by our
"T-800" zombie reaper machinery activating and hard killing the
underlying subprocess.
Some test deats:
- simulates the expect actions of a real user by manually using
`os.kill()` to send both signals to the actor-tree program.
- `pexpect`-matches against `log.devx()` emissions under normal
`debug_mode == True` usage.
- ensure we get the actual "T-800 deployed" `log.error()` msg and
that the actor tree eventually terminates!
Surrounding (re-org/impl/test-suite) changes:
- allow disabling usage via a `maybe_enable_greenback: bool` to
`open_root_actor()` but enable by def.
- pretty up the actual `.devx()` content from `.devx._stackscope`
including be extra pedantic about the conc-primitives for each signal
event.
- try to avoid double handles of `SIGUSR1` even though it seems the
original (what i thought was a) problem was actually just double
logging in the handler..
|_ avoid double applying the handler func via `signal.signal()`,
|_ use a global to avoid double handle func calls and,
|_ a `threading.RLock` around handling.
- move common fixtures and helper routines from `test_debugger` to
`tests/devx/conftest.py` and import them for use in both test mods.
The final issue was making sure we do the same thing on ctl-c/SIGINT
from the user. That is, if there's already a bg-thread in REPL, we
`log.pdb()` about SIGINT shielding and re-draw the prompt; the same UX
as normal actor-runtime-task behaviour.
Reasons this wasn't workin.. and the fix:
- `.pause_from_sync()` was overriding the local `repl` var with `None`
delivered by (transitive) calls to `_pause(debug_func=None)`.. so
remove all that and only assign it OAOO prior to thread-type case
branching.
- always call `DebugStatus.shield_sigint()` as needed from all requesting
threads/tasks:
- in `_pause_from_bg_root_thread()` BEFORE calling `._pause()` AND BEFORE
yielding back to the bg-thread via `.started(out)` to ensure we're
definitely overriding the handler in the `trio`-main-thread task
before unblocking the requesting bg-thread.
- from any requesting bg-thread in the root actor such that both its
main-`trio`-thread scheduled task (as per above bullet) AND it are
SIGINT shielded.
- always call `.shield_sigint()` BEFORE any `greenback._await()` case
don't entirely grok why yet, but it works)?
- for `greenback._await()` case always set `bg_task` to the current one..
- tweaks to the `SIGINT` handler, now renamed `sigint_shield()` so as
not to name-collide with the methods when editor-searching:
- always try to `repr()` the REPL thread/task "owner" as well as the
active `PdbREPL` instance.
- add `.devx()` notes around the prompt flushing deats and comments
for any root-actor-bg-thread edge cases.
Related/supporting refinements:
- add `get_lock()`/`get_debug_req()` factory funcs since the plan is to
eventually implement both as `@singleton` instances per actor.
- fix `acquire_debug_lock()`'s call-sig-bug for scheduling
`request_root_stdio_lock()`..
- in `._pause()` only call `mk_pdb()` when `debug_func != None`.
- add some todo/warning notes around the `cls.repl = None` in
`DebugStatus.release()`
`test_pause_from_sync()` tweaks:
- don't use a `attach_patts.copy()`, since we always `break` on match.
- do `pytest.fail()` on that ^ loop's fallthrough..
- pass `do_ctlc(child, patt=attach_key)` such that we always match the
the current thread's name with the ctl-c triggered `.pdb()` emission.
- oh yeah, return the last `before: str` from `do_ctlc()`.
- in the script, flip `abandon_on_cancel=True` since when `False` it
seems to cause `trio.run()` to hang on exit from the last bg-thread
case?!?
There's a been a todo for soo long for this XD
Since all `Actor`'s store a set of `._peers` we can try a lookup on that
table as a shortcut before pinging the registry Bo
Impl deats:
- add a new `._discovery.get_peer_by_name()` routine which attempts the
`._peers` lookup by combining a copy of that `dict` + an entry added
for `Actor._parent_chan` (since all subs have a parent and often the
desired contact is just that connection).
- change `.find_actor()` (for the `only_first == True` case),
`.query_actor()` and `.wait_for_actor()` to call the new helper and
deliver appropriate outputs if possible.
Other,
- deprecate `get_arbiter()` def and all usage in tests and examples.
- drop lingering use of `arbiter_sockaddr` arg to various routines.
- tweak the `Actor` doc str as well as some code fmting and a tweak to
the `._stream_handler()`'s initial `con_status: str` logging value
since the way it was could never be reached.. oh and `.warning()` on
any new connections which already have a `_pre_chan: Channel` entry in
`._peers` so we can start minimizing IPC duplications.
Originally discovered as while using `tractor.pause_from_sync()`
from the `i3ipc` client running in a bg-thread that uses `asyncio`
inside `modden`.
Turns out we definitely aren't correctly handling `.pause_from_sync()`
from the root actor when called from a `trio.to_thread.run_sync()`
bg thread:
- root-actor bg threads which can't `Lock._debug_lock.acquire()` since
they aren't in `trio.Task`s.
- even if scheduled via `.to_thread.run_sync(_debug._pause)` the
acquirer won't be the task/thread which calls `Lock.release()` from
`PdbREPL` hooks; this results in a RTE raised by `trio`..
- multiple threads will step on each other's stdio since cpython's GIL
seems to ctx switch threads on every input from the user to the REPL
loop..
Reproduce via reworking our example and test so that they catch and fail
for all edge cases:
- rework the `/examples/debugging/sync_bp.py` example to demonstrate the
above issues, namely the stdio clobbering in the REPL when multiple
threads and/or a subactor try to debug simultaneously.
|_ run one thread using a task nursery to ensure it runs conc with the
nursery's parent task.
|_ ensure the bg threads run conc a subactor usage of
`.pause_from_sync()`.
|_ gravely detail all the special cases inside a TODO comment.
|_ add some control flags to `sync_pause()` helper and don't use
`breakpoint()` by default.
- extend and adjust `test_debugger.test_pause_from_sync` to match (and
thus currently fail) by ensuring exclusive `PdbREPL` attachment when
the 2 bg root-actor threads are concurrently interacting alongside the
subactor:
|_ should only see one of the `_pause_msg` logs at a time for either
one of the threads or the subactor.
|_ ensure each attaches (in no particular order) before expecting the
script to exit.
Impl adjustments to `.devx._debug`:
- drop `Lock.repl`, no longer used.
- add `Lock._owned_by_root: bool` for the `.ctx_in_debug == None`
root-actor-task active case.
- always `log.exception()` for any `._debug_lock.release()` ownership
RTE emitted by `trio`, like we used to..
- add special `Lock.release()` log message for the stale lock but
`._owned_by_root == True` case; oh yeah and actually
`log.devx(message)`..
- rename `Lock.acquire()` -> `.acquire_for_ctx()` since it's only ever
used from subactor IPC usage; well that and for local root-task
usage we should prolly add a `.acquire_from_root_task()`?
- buncha `._pause()` impl improvements:
|_ type `._pause()`'s `debug_func` as a `partial` as well.
|_ offer `called_from_sync: bool` and `called_from_bg_thread: bool`
for the special case handling when called from `.pause_from_sync()`
|_ only set `DebugStatus.repl/repl_task` when `debug_func != None`
(OW ensure the `.repl_task` is not the current one).
|_ handle error logging even when `debug_func is None`..
|_ lotsa detailed commentary around root-actor-bg-thread special cases.
- when `._set_trace(hide_tb=False)` do `pdbp.set_trace(frame=currentframe())`
so the `._debug` internal frames are always included.
- by default always hide tracebacks for `.pause[_from_sync]()` internals.
- improve `.pause_from_sync()` to avoid root-bg-thread crashes:
|_ pass new `called_from_xxx_` flags and ensure `DebugStatus.repl_task`
is actually set to the `threading.current_thread()` when needed.
|_ manually call `Lock._debug_lock.acquire_nowait()` for the non-bg
thread case.
|_ TODO: still need to implement the bg-thread case using a bg
`trio.Task`-in-thread with an `trio.Event` set by thread REPL exit.
It's been a long time prepped and now finally implemented!
Offer a `shield: bool` argument from our async `._debug` APIs:
- `await tractor.pause(shield=True)`,
- `await tractor.post_mortem(shield=True)`
^-These-^ can now be used inside cancelled `trio.CancelScope`s,
something very handy when introspecting complex (distributed) system
tear/shut-downs particularly under remote error or (inter-peer)
cancellation conditions B)
Thanks to previous prepping in a prior attempt and various patches from
the rigorous rework of `.devx._debug` internals around typed msg specs,
there ain't much that was needed!
Impl deats
- obvi passthrough `shield` from the public API endpoints (was already
done from a prior attempt).
- put ad-hoc internal `with trio.CancelScope(shield=shield):` around all
checkpoints inside `._pause()` for both the root-process and subactor
case branches.
Add a fairly rigorous example, `examples/debugging/shielded_pause.py`
with a wrapping `pexpect` test, `test_debugger.test_shield_pause()` and
ensure it covers as many cases as i can think of offhand:
- multiple `.pause()` entries in a loop despite parent scope
cancellation in a subactor RPC task which itself spawns a sub-task.
- a `trio.Nursery.parent_task` which raises, is handled and
tries to enter and unshielded `.post_mortem()`, which of course
internally raises `Cancelled` in a `._pause()` checkpoint, so we catch
the `Cancelled` again and then debug the debugger's internal
cancellation with specific checks for the particular raising
checkpoint-LOC.
- do ^- the latter -^ for both subactor and root cases to ensure we
can debug `._pause()` itself when it tries to REPL engage from
a cancelled task scope Bo
Since turns out we didn't have a single example using that API Bo
The test granular-ly checks all use cases:
- `.post_mortem()` manual calls in both subactor and root.
- ensuring built-in RPC crash handling activates after each manual one
from ^.
- drafted some call-stack frame checking that i commented out for now
since we need to first do ANSI escape code removal due to the
colorization that `pdbp` does by default.
|_ added a TODO with SO link on `assert_before()`.
Also todo-staged a shielded-pause test to match with the already
existing-but-needs-refinement example B)
Since I needed the `break_ipc()` helper from the
`examples/advanced_faults/ipc_failure_during_stream.py` used in the
`test_advanced_faults` suite, might as well move it into a pkg-wide
importable module. Also changed the default break method to be
`socket_close` which just calls `Stream.socket.close()` underneath in
`trio`.
Also tweak that example to not keep sending after the stream has been
broken since with new `trio` that will raise `ClosedResourceError` and
in the wrapping test we generally speaking want to see a hang and then
cancel via simulated user sent SIGINT/ctl-c.
It's **almost** there, we're just missing the final translation code to
get from an `asyncio` side task to be able to call
`.devx._debug..wait_for_parent_stdin_hijack()` to do root actor TTY
locking. Then we just need to ensure internals also do the right thing
with `greenback()` for equivalent sync `breakpoint()` style pause
points.
Since i'm deferring this until later, tossing in some xfail tests to
`test_infected_asyncio` with TODOs for the needed implementation as well
as eventual test org.
By "provision" it means we add:
- `greenback` init block to `_run_asyncio_task()` when debug mode is
enabled (but which will currently rte when `asyncio` is detected)
using `.bestow_portal()` around the `asyncio.Task`.
- a call to `_debug.maybe_init_greenback()` in the `run_as_asyncio_guest()`
guest-mode entry point.
- as part of `._debug.Lock.is_main_trio_thread()` whenever the async-lib
is not 'trio' error lock the backend name (which is obvi `'asyncio'`
in this use case).
Now supports use from any `trio` task, any sync thread started with
`trio.to_thread.run_sync()` AND also via `breakpoint()` builtin API!
The only bit missing now is support for `asyncio` tasks when in infected
mode.. Bo
`greenback` setup/API adjustments:
- move `._rpc.maybe_import_gb()` to -> `devx._debug` and factor out the cached
import checking into a sync func whilst placing the async `.ensure_portal()`
bootstrapping into a new async `maybe_init_greenback()`.
- use the new init-er func inside `open_root_actor()` with the output
predicating whether we override the `breakpoint()` hook.
core `devx._debug` implementation deatz:
- make `mk_mpdb()` only return the `pdp.Pdb` subtype instance since
the sigint unshielding func is now accessible from the `Lock`
singleton from anywhere.
- add non-main thread support (at least for `trio.to_thread` use cases)
to our `Lock` with a new `.is_trio_thread()` predicate that delegates
directly to `trio`'s internal version.
- do `Lock.is_trio_thread()` checks inside any methods which require
special provisions when invoked from a non-main `trio` thread:
- `.[un]shield_sigint()` methods since `signal.signal` usage is only
allowed from cpython's main thread.
- `.release()` since `trio.StrictFIFOLock` can only be called from
a `trio` task.
- rework `.pause_from_sync()` itself to directly call `._set_trace()`
and don't bother with `greenback._await()` when we're already calling
it from a `.to_thread.run_sync()` thread, oh and try to use the
thread/task name when setting `Lock.local_task_in_debug`.
- make it an RTE for now if you try to use `.pause_from_sync()` from any
infected-`asyncio` task, but support is (hopefully) coming soon!
For testing we add a new `test_debugger.py::test_pause_from_sync()`
which includes a ctrl-c parametrization around the
`examples/debugging/sync_bp.py` script which includes all currently
supported/working usages:
- `tractor.pause_from_sync()`.
- via `breakpoint()` overload.
- from a `trio.to_thread.run_sync()` spawn.
More or less just simplifies to not seeing the stream closure errors and
instead expecting KBIs from the simulated user who 'ctl-cs after hang'.
Toss in a little `stuff_hangin_ctlc()` to the script to wrap all that
and always check stream closure before sending the final KBI.
- `trio_typing` is nearly obsolete since `trio >= 0.23`
- `exceptiongroup` is built-in to python 3.11
- `async_generator` primitives have lived in `contextlib` for quite
a while!
Since importing from our top level `conftest.py` is not scaleable
or as "future forward thinking" in terms of:
- LoC-wise (it's only one file),
- prevents "external" (aka non-test) example scripts from importing
content easily,
- seemingly(?) can't be used via abs-import if using
a `[tool.pytest.ini_options]` in a `pyproject.toml` vs.
a `pytest.ini`, see:
https://docs.pytest.org/en/8.0.x/reference/customize.html#pyproject-toml)
=> Go back to having an internal "testing" pkg like `trio` (kinda) does.
Deats:
- move generic top level helpers into pkg-mod including the new
`expect_ctxc()` (which i needed in the advanced faults testing script.
- move `@tractor_test` into `._testing.pytest` sub-mod.
- adjust all the helper imports to be a `from tractor._testing import <..>`
Rework `test_ipc_channel_break_during_stream()` and backing script:
- make test(s) pull `debug_mode` from new fixture (which is now
controlled manually from `--tpdb` flag) and drop the previous
parametrized input.
- update logic in ^ test for "which-side-fails" cases to better match
recently updated/stricter cancel/failure semantics in terms of
`ClosedResouruceError` vs. `EndOfChannel` expectations.
- handle `ExceptionGroup`s with expected embedded errors in test.
- better pendantics around whether to expect a user simulated KBI.
- for `examples/advanced_faults/ipc_failure_during_stream.py` script:
- generalize ipc breakage in new `break_ipc()` with support for diff
internal `trio` methods and a #TODO for future disti frameworks
- only make one sub-actor task break and the other just stream.
- use new `._testing.expect_ctxc()` around ctx block.
- add a bit of exception handling with `print()`s around ctxc (unused
except if 'msg' break method is set) and eoc cases.
- don't break parent side ipc in loop any more then once
after first break, checked via flag var.
- add a `pre_close: bool` flag to control whether
`MsgStreama.aclose()` is called *before* any ipc breakage method.
Still TODO:
- drop `pytest.ini` and add the alt section to `pyproject.py`.
-> currently can't get `--rootdir=` opt to work.. not showing in
console header.
-> ^ also breaks on 'tests' `enable_modules` imports in subactors
during discovery tests?
Only found this by luck more or less (while working on something in
a client project) and it turns out we can actually get to (yet another)
hang state where SIGINT will be ignored by the root actor on teardown..
I've added all the necessary logic flags to reproduce. We obviously need
a follow up bug issue and a test suite to replicate!
It appears as though the following are required based on very light
tinkering:
- infected asyncio mode active
- debug mode active
- the `trio` context must breakpoint *before* `.started()`-ing
- the `asyncio` must **not** error
Previously we were leaking our (pdb++) override into the Python runtime
which would always result in a runtime error whenever `breakpoint()` is
called outside our runtime; after exit of the root actor . This
explicitly restores any previous hook override (detected during startup)
or deletes the hook and restores the environment if none existed prior.
Also adds a new WIP debugging example script to ensure breakpointing
works as normal after runtime close; this will be added to the test
suite.
With the new fancy `_pytest.pathlib.import_path()` we can do real
parametrization of the example-script-module code and thus configure
whether the child, parent, or both silently break the IPC connection.
Parametrize the test for all the above mentioned cases as well as the
case where the IPC never breaks but we still simulate the user hammering
ctl-c / SIGINT to terminate the actor tree. Adjust expected errors based
on each case and heavily document each of these.
Use a task nursery in the subactor to spawn tasks which cancel the IPC
channel mid stream to simulate the most concurrent case we're likely to
see. Make `main()` accept a `debug_mode: bool` for parametrization. Fill
out detailed comments/docs on this example.
Finally! I think this may be the root issue we've been seeing in
production in a client project.
No idea yet why this is happening but the fault-causing sequence seems
to be:
- `.open_context()` in a child actor
- enter the debugger via `tractor.breakpoint()`
- continue from that entry via `c` command in REPL
- raise an error just after inside the context task's body
Looking at logging it appears as though the child thinks it has the tty
but no input is accepted on the REPL and a further `ctrl-c` results in
some teardown but also a further hang where both parent and child become
unresponsive..