The method now returns a `bool` which flags whether the transport died
to the caller and allows for reporting a disconnect in the
channel-transport handler task. This is something a user will normally
want to know about on the caller side especially after seeing
a traceback from the peer (if in tree) on console.
There's no point in sending a cancel message to the remote linked task
and especially no reason to block waiting on a result from that task if
the transport layer is detected to be disconnected. We expect that the
transport shouldn't go down at the layer of the message loop
(reconnection logic should be handled in the transport layer itself) so
if we detect the channel is not connected we don't bother requesting
cancels nor waiting on a final result message.
Why?
- if the connection goes down in error the caller side won't have a way
to know "how long" it should block to wait for a cancel ack or result
and causes a potential hang that may require an additional ctrl-c from
the user especially if using the debugger or if the traceback is not
seen on console.
- obviously there's no point in waiting for messages when there's no
transport to deliver them XD
Further, add some more detailed cancel logging detailing the task and
actor ids.
There's a bug that's triggered in the stdlib without latest `pdb++`
installed; add a note for that.
Further inside `wait_for_parent_stdin_hijack()` don't `.started()` until
the interactor stream has been opened to avoid races when debugging this
`._debug.py` module (at the least) since we usually don't want the
spawning (parent) task to resume until we know for sure the tty lock has
been acquired. Also, drop the random checkpoint we had inside
`_breakpoint()`, not sure it was actually adding anything useful since
we're (mostly) carefully shielded throughout this func.
Finally! I think this may be the root issue we've been seeing in
production in a client project.
No idea yet why this is happening but the fault-causing sequence seems
to be:
- `.open_context()` in a child actor
- enter the debugger via `tractor.breakpoint()`
- continue from that entry via `c` command in REPL
- raise an error just after inside the context task's body
Looking at logging it appears as though the child thinks it has the tty
but no input is accepted on the REPL and a further `ctrl-c` results in
some teardown but also a further hang where both parent and child become
unresponsive..
None of it worked (you still will see `.__exit__()` frames on debugger
entry - you'd think this would have been solved by now but, shrug) so
instead wrap the debugger entry-point in a `try:` and put the SIGINT
handler restoration inside `MultiActorPdb` teardown hooks.
This seems to restore the UX as it was prior but with also giving the
desired SIGINT override handler behaviour.
Using either of `@pdb.hideframe` or `__tracebackhide__` on stdlib
methods doesn't seem to work either.. This all seems to have something
to do with async generator usage I think ?
This gets very close to avoiding any possible hangs to do with tty
locking and SIGINT handling minus a special case that will be detailed
below.
Summary of implementation changes:
- convert `_mk_pdb()` -> `with _open_pdb() as pdb:` which implicitly
handles the `bdb.BdbQuit` case such that debugger teardown hooks are
always called.
- rename the handler to `shield_sigint()` and handle a variety of new
cases:
* the root is in debug but hasn't been cancelled -> call
`Actor.cancel_soon()`
* the root is in debug but *has* been called (`Actor.cancel_soon()`
already called) -> raise KBI
* a child is in debug *and* has a task locking the debugger -> ignore
SIGINT in child *and* the root actor.
- if the debugger instance is provided to the handler at acquire time,
on SIGINT handling completion re-print the last pdb++ REPL output so
that the user realizes they are still actively in debug.
- ignore the unlock case where a race condition of "no task" holding the
lock causes the `RuntimeError` normally associated with the "wrong
task" doing so (not sure if this is a `trio` bug?).
- change debug logs to runtime level.
Unhandled case(s):
- a child is maybe in debug mode but does not itself have any task using
the debugger.
* ToDo: we need a way to decide what to do with
"intermediate" child actors who themselves either are not in
`debug_mode=True` but have children who *are* such that a SIGINT
won't cause cancellation of that child-as-parent-of-another-child
**iff** any of their children are in in debug mode.
Python 3.9's new object resolver + a `str` is much simpler then mucking
with tuples (and easier to serialize). Include a `.to_tuple()` formatter
since we still are passing the module namespace and function name
separately inside the runtime's message format but in theory we might be
able to simplify this depending on how we would change the support for
`enable_modules:list[str]` in the spawn API.
Thanks to @Fuyukai for pointing `resolve_name()` which I didn't know
about before!
Adjust the `soft_wait()` strategy to avoid sending needless cancel
requests if it is known that a child process is already terminated or
does so before the cancel request times out. This should be no slower
and should avoid needless waits on either closure-in-progress or already
closed channels.
Basic strategy is,
- request child actor to cancel
- if process termination is detected, cancel the cancel
- if the process is still alive after a cancel request timeout warn the
user and yield back to the hard reap handling