A context stream overrun should normally never take place since if
a stream is opened (via ``Context.open_stream()``) backpressure is
applied on the message buffer (unless explicitly disabled by the
``backpressure=False`` flag) such that an overrun on the receiving task
should result in blocking the (remote) sender task (eventually depending
on the underlying ``MsgStream`` transport).
Here we add a special error message that reports if one side never
opened a stream and let's the user know in the overrun error message
that they may be trying to push messages to a task that isn't ready to
receive them.
Further fixes / details:
- pop any `Context` at the end of any `_invoke()` task that creates
one and registers with the runtime.
- ignore but warn about messages received for a context that either
no longer exists or is unknown (guarding against crashes by malicious
packets in the latter case)
Keeping it disabled on context open will help with detecting any stream
connection which was never opened on one side of the task pair. In that
case we can report that there was an overrun **and** a stream wasn't
opened versus if the stream is explicitly configured not to use bp then
we throw the standard overflow.
Use `trio.Nursery._closed` to detect "closure" XD since it seems to be
the most reliable way to determine if a spawn call will trigger
a runtime error.
Half of portal API usage requires a 1 message response (`.run()`,
`.run_in_actor()`) and the streaming APIs should probably be explicitly
enabled for backpressure if desired by the user. This makes more sense
in (psuedo) realtime systems where it's better to notify on a block then
freeze without notice. Make this default behaviour with a new error to
be raised: `tractor._exceptions.StreamOverrun` when a sender overruns
a stream by the default size (2**6 for now). The old behavior can be
enabled with `Context.open_stream(backpressure=True)` but now with
warning log messages when there are overruns.
Add task-linked-context error propagation using a "nursery raising"
technique such that if either end of context linked pair of tasks
errors, that error can be relayed to other side and raised as a form of
interrupt at the receiving task's next `trio` checkpoint. This enables
reliable error relay without expecting the (error) receiving task to
call an API which would raise the remote exception (which it might never
currently if using `tractor.MsgStream` APIs).
Further internal implementation details:
- define the default msg buffer size as `Actor.msg_buffer_size`
- expose a `msg_buffer_size: int` kwarg from `Actor.get_context()`
- maybe raise aforementioned context errors using
`Context._maybe_error_from_remote_msg()` inside `Actor._push_result()`
- support optional backpressure on a stream when pushing messages
in `Actor._push_result()`
- in `_invote()` handle multierrors raised from a `@tractor.context`
entrypoint as being potentially caused by a relayed error from the
remote caller task, if `Context._error` has been set then raise that
error inside the `RemoteActorError` that will be relayed back to that
caller more or less proxying through the source side error back to its
origin.
In preparation for supporting both backpressure detection (through an
optional error) as well as control over the msg channel buffer size, add
internal configuration flags for both to contexts. Also adjust
`Context._err_on_from_remote_msg()` -> `._maybe..` such that it can be
called and will only raise if a scope nursery has been set. Add
a `Context._error` for stashing the remote task's error that may be
delivered in an `'error'` message.
This more formally declares the runtime's remote task startingn API
and uses it throughout all the dependent `Portal` API methods.
Allows dropping `Portal._submit()` and simplifying `.run_in_actor()`
style result waiting to be delegated to the context APIs at remote
task `return` response time. We now also track the remote entrypoint
"type` as `Context._remote_func_type`.
Instead of tracking feeder mem chans per RPC dialog, store `Context`
instances which (now) hold refs to the underlying RPC-task feeder chans
and track them inside a `Actor._contexts` map. This begins a transition
to making the "context" idea the primitive abstraction for representing
messaging dialogs between tasks in different memory domains (i.e.
usually separate processes).
A slew of changes made this possible:
- change `Actor.get_memchans()` -> `.get_context()`.
- Add new `Context._send_chan` and `._recv_chan` vars.
- implicitly create a new context on every `Actor.send_cmd()` call.
- use the context created by `.send_cmd()` in `Portal.open_context()`
instead of manually creating one.
- call `Actor.get_context()` inside tasks run from `._invoke()`
such that feeder chans are implicitly created for callee tasks
thus fixing the bug #265.
NB: We might change some of the internal semantics to do with *when* the
feeder chans are actually created to denote whether or not a far end
task is actually *read to receive* messages. For example, in the cases
where it **never** will be ready to receive messages (one-way streaming,
a context that never opens a stream, etc.) we will likely want some kind
of error or at least warning to the caller that messages can't be sent
(yet).
Previously we were ignoring a race where the callee an opened task
context could enter `Context.open_stream()` before calling `.started().
Disallow this as well as calling `.started()` more then once.
We don't need to any more presuming you get ideal remote cancellation
conditions where the remote actor should teardown and kill the streams
from its end.
On msg loop termination we now check and see if a channel is associated
with a child-actor registered in some local task's nursery. If so, we
attempt to wait on channel closure initiated from the child side (by
draining the underlying msg stream) so as to avoid closing it too early
resulting in the child not relaying its termination status response. This
means we now support the ideal case in 2-general's where we get back the
ack to the closure request instead of just ignoring it and timing out XD
The main implementation detail is that when `Portal.cancel_actor()`
remotely calls `Actor.cancel()` we actually wait for the RPC response
from that request before allowing the channel shutdown sequence to
engage. The new msg stream draining support enables this.
Also, factor child-to-parent error propagation logic into a helper func
and improve some docs (yeah yeah y'all don't like the ''', i don't
care - it makes my eyes not hurt).
Use a `trio.Event` to enable nursery closure detection such that core
runtime tasks can be notified when a local nursery exits and allow
shutdown protocols to operate without close-before-terminate issues
(such as IPC channel closure during remote peer cancellation).
Enables "draining" the last set of messages after a channel/stream has
been terminated mostly for the purposes of receiving a final ACK to
a remote cancel command. Also, add an internal `Channel._cancel_called`
flag which can be set by `Portal.cancel_actor()`.
It's definitely possible to have a nursery spawn task be cancelled
before a `trio.Process` handle is ever returned; we now handle this
case as a cancelled-during-spawn scenario. Zombie collection logic
also is bypassed in this case.
Thanks to @richardsheridan for pointing out the limitations of using
*any* kind of value as the result-cached-flag and how it might cause
problems for anyone returning pickled blob-data. This changes the
`Portal` internal result value tracking to stash the full message from
which the value can be retrieved by any `Portal.result()` caller.
The internal change is that `Portal._return_once()` now returns a tuple
of the message *and* its value.
Fixes the issue where if the main remote task returns `None`,
`Portal.result()` would erroneously wait again on the underlying feeder
mem chan since `None` was being used as the cache flag. Instead set the
flag as the channel uid and consider the result collected when set to
anything else (since it would be odd to return that value from a remote
task when you already can read it as part of portal/channel apis).
The api we've made here is actually closer to `asyncio.gather()` but
with opening async context managers instead of funcs. Use another event
to allow for graceful teardown of children on non-cancellation exits
and add a doc string.
Since it seems we're building out more and more higher level primitives
in order to support certain parallel style actor trees and messaging
patterns (eg. task broadcast channels), we might as well start a new
sub-package for purely `trio` constructions. We hereby dub this
the realm of `trionics` (like electronics but for trios instead of
electrons).
To kick things off, add an `async_enter_all()` concurrent
exit-stack-like context manager API which will concurrently spawn
a sequence of provided async context managers and deliver their ordered
results but with proper support for `trio` cancellation semantics.
The stdlib's `AsyncExitStack` is not compatible with nurseries not
`trio` tasks (which are cancelled) since as task will be suspended on
the stack after push and does not ever hit a checkpoint until the stack
is closed.
This is actually surprisingly easy to grok having gone through a lot of
pain understanding edge cases in the zombie lord dev branch. Basically
we just need to make sure actors are managed in a 2 step reap sequence.
In the "soft" reap phase we wait for the process to terminate on its own
concurrently with (maybe) waiting for its portal's final result (if it's
a `.run_in_actor()`). If this path is cancelled or errors, then we do
a "hard" reap where we timeout and send a signal to the proc to
terminate immediately. The only last remaining trick is to tie in the
root-is-debugger-aware logic to yet again avoid tty clobbers.
As for `Actor.cancel()` requests, do the same for
`Actor._cancel_task()` but use `_invoke()` to ensure
correct msg transactions with caller. Don't cancel task
cancels on a cancel-all-tasks operation in attempt at
more determinism.
Now that we're on our way to a (somewhat) serious beta release I think
it's about time to start de-noising the logging emissions. Since we're
trying out this approach of "stack layer oriented" log levels, I figured
this is a good time to move most of the "warnings" to what they should
be: cancellation monitoring status messages. The level is set to 16
which is just above our "runtime" level but just below the traditional
"info" level. I think this will be a decent approach since usually if
you're confused about why your `tractor` app is behaving unlike you
expect, it's 90% of the time going to be to do with cancellation or
error propagation. This this setup a user can specify the 'cancel' level
and see all the msgs pertaining to both actor and task-in-actor
cancellation mechanics.
The stdlib's `logging.LoggingAdapter` doesn't currently pass through
`stacklevel: int` down to its wrapped logger instance. Hack it here
and get our msgs looking like they would if using a built-in level.
In an effort to have some kind of more formal interface around the
transport layer, add a `MsgTransport` protocol type and use with
the channel composition of message streams. Start a little "key map"
of `(<codec>, <protocol>)` to `MsgTransport` types which can be
dynamically loaded. Add a `Channel.from_stream()` constructor thus
cleaning up the mangled logic that was in the constructor based on
inputs. Drop all the "auto reconnect" channel logic for now since
nothing is using it (internally) and it's likely it will need rework
once we bring in a protocol besides TCP.
`msgspec` sends python lists over the wire
(https://github.com/jcrist/msgspec/issues/30) which is fine and dandy
but we use them as lookup keys so we need to be sure we tuple-cast
first.
This change some super old (and bad) code from the project's very early
days. For some redic reason i must have thought masking `trio`'s
internal stream / transport errors and a TCP EOF as `StopAsyncIteration`
somehow a good idea. The reality is you probably
want to know the difference between an unexpected transport error
and a simple EOF lol. This begins to resolve that by adding our own
special `TransportClosed` error to signal the "graceful" termination of
a channel's underlying transport. Oh, and this builds on the `msgspec`
integration which helped shed light on the core issues here B)
Add a `tractor._ipc.MsgspecStream` type which can be swapped in for
`msgspec` serialization transparently. A small msg-length-prefix framing
is implemented as part of the type and we use
`tricycle.BufferedReceieveStream` to handle buffering logic for the
underlying transport.
Notes:
- had to force cast a few more list -> tuple spots due to no native
`tuple`decode-by-default in `msgspec`: https://github.com/jcrist/msgspec/issues/30
- the framing can be understood by this protobuf walkthrough:
https://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers
- `tricycle` becomes a new dependency
Can only really use an encoder currently since there is no streaming api
in `msgspec` as of currently. See jcrist/msgspec#27.
Not sure if any encoding speedups are currently noticeable especially
without any validation going on yet XD.
First experiments toward #196
- drop `shield` input to `MsgStream`
- check for cancel called prior to loading the feeder mem chan
in `Context.open_stream()`
- warn on a timeout when trying to cancel a remote task from
`Context.cancel()`
- drop noop endofchannel handler block
The `collections.deque` takes care of array length truncation of values
for us implicitly but in the future we'll likely want this value exposed
to alternate array implementations. This patch is to provide for that as
well as make `mypy` happy since the `dequeu.maxlen` can also be `None`.
Get rid of all the (requirements for) clones of the underlying
receivable. We can just use a uuid generated key for each instance
(thinking now this can probably just be `id(self)`). I'm fully convinced
now that channel cloning is only a source of confusion and anti-patterns
when we already have nurseries to define resource lifetimes. There is no
benefit in particular when you allocate subscriptions using a context
manager (not sure why `trio.open_memory_channel()` doesn't enforce
this).
Further refinements:
- add a `._closed` state that will error the receiver on reuse
- drop module script section; it's been moved to a real test
- call the "receiver" duck-type stub a new name
This allows for wrapping an existing stream by re-assigning its receive
method to the allocated broadcaster's `.receive()` so as to avoid
expecting any original consumer(s) of the stream to now know about the
broadcaster; this instead mutates the stream to delegate to the new
receive call behind the scenes any time `.subscribe()` is called.
Add a `typing.Protocol` for so called "cloneable channels" until we
decide/figure out a better keying system for each subscription and
mask all undesired typing failures.
Add `ReceiveMsgStream.subscribe()` which allows allocating a broadcast
receiver around the stream for use by multiple actor-local consumer
tasks. Entering this context manager idempotently mutates the stream's
receive machinery which for now can not be undone. Move `.clone()` to
the receive stream type.
Resolves#204
For every set of broadcast receivers which pull from the same producer,
we need a singleton state for all of,
- subscriptions
- the sender ready event
- the queue
Add a `BroadcastState` dataclass for this and pass it to all
subscriptions. This makes the design much more like the built-in memory
channels which do something very similar with `MemoryChannelState`.
Use a `filter()` on the subs list in the sequence update step, plus some
other commented approaches we can try for speed.
Using the current task as a subscription key fails horribly as soon as
you hand off new subscription receiver to another task you've spawned..
Instead use the underlying ``trio.abc.ReceiveChannel.clone()`` as a key
(so i guess we're assuming cloning is supported by the underlying?)
which makes this all work just like default mem chans. As a bonus, now
we can just close the underlying rx (which may be a clone) on
`.aclose()` and everything should just work in terms of the underlying
channels lifetime (i think?).
Change `.subscribe()` to be async since the receive channel type
interface only expects `.aclose()` and it actually ends up being
nicer for 3.9+ style `async with` parentheses style anyway.
Buncha improvements:
- pass in the queue via constructor
- tracking over all underlying memory channel closure using cloning
- do it like `tokio` and set lagged consumers to the last sequence
before raising
- copy the subs on first receiver wakeup for iteration instead of
iterating the table directly (and being forced to skip the current
tasks sequence increment)
- implement `.aclose()` to close the underlying clone for this task
- make `broadcast_receiver()` just take the recv chan since it doesn't
need anything on the send side.
We're not actually using this but it's for reference if we do end up
needing it.
The std lib's `pdb` internals override SIGINT handling whenever one
enters the debugger repl. Force a handler that kills the tree if SIGINT
is triggered from the root actor, otherwise ignore it since supervised
children should be managed already. This resolves an issue with guest
mode where `pdb` causes SIGINTs to be swallowed resulting in the host
loop never terminating the process tree.
The whole origin was not having an explicit open/close semantic for
streams. We have that now so this internal mechanic isn't needed and
further our streams become more correct by having `.aclose()` be
independent of cancellation.
Finally this makes a cancelled root actor nursery not clobber child
tasks which request and lock the root's tty for the debugger repl.
Using an edge triggered event which is set after all fifo-lock-queued
tasks are complete, we can be sure that no lingering child tasks are
going to get interrupted during pdb use and tty lock acquisition.
Further, even if new tasks do queue up to get the lock, the root will
incrementally send cancel msgs to each sub-actor only once the tty is
not locked by a (set of) child request task(s). Add shielding around all
the critical sections where the child attempts to allocate the lock from
the root such that it won't be disrupted from cancel messages from the
root after the acquire lock transaction has started.
If the root calls `trio.Process.kill()` on immediate child proc teardown
when the child is using pdb, we can get stdstreams clobbering that
results in a pdb++ repl where the user can't see what's been typed. Not
killing such children on cancellation / error seems to resolve this
issue whilst still giving reliable termination. For now, code that
special path until a time it becomes a problem for ensuring zombie
reaps.
A context is the natural fit (vs. a receive stream) for locking the root
proc's tty usage via it's `.started()` sync point. Simplify the
`_breakpoin()` routine to be a simple async func instead of all this
"returning a coroutine" stuff from before we decided that
`tractor.breakpoint()` must be async. Use `runtime` level for locking
logging making it easier to trace.
Another face palm that was causing serious issues for code that is using
the `.shielded` feature..
Add a bunch more detailed comments for all this subtlety and hopefully
get it right once and for all. Also aggregated the `trio` errors that
should trigger closure inside `.aclose()`, hopefully that's right too.
Revert this change since it really is poking at internals and doesn't
make a lot of sense. If the context is going to be cancelled then the
msg loop will tear down the feed memory channel when ready, we don't
need to be clobbering it and confusing the runtime machinery lol.
Add clear teardown semantics for `Context` such that the remote side
cancellation propagation happens only on error or if client code
explicitly requests it (either by exit flag to `Portal.open_context()`
or by manually calling `Context.cancel()`). Add `Context.result()`
to wait on and capture the final result from a remote context function;
any lingering msg sequence will be consumed/discarded.
Changes in order to make this possible:
- pass the runtime msg loop's feeder receive channel in to the context
on the calling (portal opening) side such that a final 'return' msg
can be waited upon using `Context.result()` which delivers the final
return value from the callee side `@tractor.context` async function.
- always await a final result from the target context function in
`Portal.open_context()`'s `__aexit__()` if the context has not
been (requested to be) cancelled by client code on block exit.
- add an internal `Context._cancel_called` for context "cancel
requested" tracking (much like `trio`'s cancel scope).
- allow flagging a stream as terminated using an internal
`._eoc` flag which will mark the stream as stopped for iteration.
- drop `StopAsyncIteration` catching in `.receive()`; it does
nothing.
This mostly adds the api described in
https://github.com/goodboy/tractor/issues/53#issuecomment-806258798
The first draft summary:
- formalize bidir steaming using the `trio.Channel` style interface
which we derive as a `MsgStream` type.
- add `Portal.open_context()` which provides a `trio.Nursery.start()`
remote task invocation style for setting up and tearing down tasks
contexts in remote actors.
- add a distinct `'started'` message to the ipc protocol to facilitate
`Context.start()` with a first return value.
- for our `ReceiveMsgStream` type, don't cancel the remote task in
`.aclose()`; this is now done explicitly by the surrounding `Context`
usage: `Context.cancel()`.
- streams in either direction still use a `'yield'` message keeping the
proto mostly symmetric without having to worry about which side is the
caller / portal opener.
- subtlety: only allow sending a `'stop'` message during a 2-way
streaming context from `ReceiveStream.aclose()`, detailed comment
with explanation is included.
Relates to #53
Since we currently have no real "discovery protocol" between process
trees, the current naive approach is to check via a connect and drop to
see if a TCP server is bound to a particular address during root actor
startup. This was a historical decision and had no real grounding beyond
taking a simple approach to get something working when the project
was first started.
This is obviously problematic from an error handling perspective since
we need to be able to avoid such quick connect-and-drops from cancelling
an "arbiter"'s (registry actor's) channel-msg loop machinery (which
would propagate and cancel the actor).
For now we map this particular TCP error, which gets remapped by `trio`
as a `trio.BrokenResourceError` to our own internal `TransportClosed`
which is swallowed by channel message loop processing and indicates
a graceful teardown of the far end actor.
This change some super old (and bad) code from the project's very early
days. For some redic reason i must have thought masking `trio`'s
internal stream / transport errors and a TCP EOF as `StopAsyncIteration`
somehow a good idea. The reality is you probably
want to know the difference between an unexpected transport error
and a simple EOF lol. This begins to resolve that by adding our own
special `TransportClosed` error to signal the "graceful" termination of
a channel's underlying transport. Oh, and this builds on the `msgspec`
integration which helped shed light on the core issues here B)
It's clear now that special attention is needed to handle the case where
a spawned `multiprocessing` proc is started but then the parent is
cancelled before the child can connect back; in this case we need to be
sure to kill the near-zombie child asap. This may end up being the
solution to other resiliency issues seen around mp with nested process
trees too. More testing is needed to be sure.
Relates to #84#89#134#146
NB: this is a breaking change removing support for `Portal.run()` being
able to invoke remote streaming functions and instead replacing the
method call with an async context manager api `Portal.open_stream_from()`
This style explicitly defines stream teardown at the call site instead
of expecting the user to handle tricky things correctly themselves: eg.
`async_geneartor.aclosing()`. Going forward `Portal.run()` can be used
only for invoking async functions.
Move receive stream into streaming modules and rebrand as a "message
stream". Factor out cancellation mechanics in `.aclose()` into the
`Context` type which will soon provide the api for for cancelling portal
invocations. Comment-stage a few methods on both types in anticipation
of a new bi-directional streaming api. Add a `MsgStream` bidirectional
channel type which will be the eventual type yielded from
`Context.open_stream()`. Adjust the response/dialog types to be the set
`{'asyncfun', 'asyncgen', 'context'}`. OH, and add async func checking
in `Portal.run()` to catch and error on sync funcs early.
You can always wrap a sync function in an async one and there seems to
be no good reason to support invoking them directly especially since
cancellation won't work without some thread hackery. If it's requested
we'll point users to `trio-parallel`.
Resolves#77
Add a sync method that can be used to cancel the current actor from
a synchronous context. This is useful in debugging situations where
sync debugger code may need to kill the process tree.
Also, make the internal "lifetime stack" a global var; easier to manage
from client code that may was to add callbacks prior to the actor
runtime being fully setup.
Using `None` as the default key for a `@msg.pub` can cause conflicts if
there is more then one "taskless" (no tasks={,} passed) pub offered on
an actor... So instead use the first trio "task name" (usually just the
function name) instead thus avoiding this very hard to debug and
understand problem.
Probably should throw in a test but I'm super lazy today.
This begins the move to dropping support for `tractor.run()` which we
don't really need since the runtime is started (as it always has been)
from a new sub-task / nursery. Instead this introduces starting the
actor tree through a `open_root_actor()` async context manager which
we'll likely implicitly call (from the root) on the first use of an
actor nursery.
Drop `_actor._start_actor()` and factor its contents into this new api.
Make `run()` and `run_daemon()` use `open_root_actor()` until we decide
to remove them.
Relates to #168 and #177
It turns out in order to maintain our sneaky little "call an `Actor`
method in this remote process" we still need the ability to invoke
functions from a namespace. We're currently using a "self" namespace as
a way to do this for internal inter-process method calling. Either way,
I see no reason not to keep a public method for this invoke style (we
just won't market it) since it is still how the machinery works
underneath.
This resolves and completes #69 allowing all RPC invocation APIs to pass
function references directly instead of explicit `str` names for the
target namespace and function (this is still done implicitly
underneath). This brings us closer to `trio`'s task running API as well
as acknowledges that any inter-host RPC system (and API) will likely
need to be implemented on top of local RPC primitives anyway. Even if
this ends up **not** being true we can always go to "function stubs" as
part of our IAC protocol or, add a new method to do explicit namespace
calls: `.run_from_module()` or whatever everyone votes on.
Resolves#69
Further, this commit drops `Actor.statespace` from the entire system
since a user can easily get this same functionality using module
level variables. Fix docs to match all these changes (luckily mostly
already done due to example scripts referencing).
Add a ``tractor._portal.StreamReceiveChannel.shield_channel()`` context
manager which allows for avoiding the closing of an IPC stream's
underlying channel for the purposes of task re-spawning. Sometimes you
might want to cancel a task consuming a stream but not tear down the IPC
between actors (the default). A common use can might be where the task's
"setup" work might need to be redone but you want to keep the
established portal / channel in tact despite the task restart.
Includes a test.
Turns out this is a lower level issue in terms of the stdlib's default
`pdb.Pdb` settings and how they conflict with `trio`s cancellation and
KBI handling. The details are hashed out more thoroughly in
python-trio/trio#1155. Maybe we can get a fix in trio so things are
solved under our feet :)
The channel server should be torn down *before* the rpc
task/service nursery. Do this explicitly even in the root's main task
to avoid a strange hang I found in the pubsub tests. Start dropping
the `warnings.warn()` usage.
Add `Actor._cancel_called` and `._cancel_complete` making it possible to
determine whether the actor has started the cancellation sequence and
whether that sequence has fully completed. This allows for blocking in
internal machinery tasks as necessary. Also, always trigger the end of
ongoing rpc tasks even if the last task errors; there's no guarantee the
trio cancellation semantics will guarantee us a nice internal "state"
without this.
For reliable remote cancellation we need to "report" `trio.Cancelled`s
(just like any other error) when exhausting a portal such that the
caller can make decisions about cancelling the respective actor if need
be.
Resolves#156
Every subactor in the tree now receives the socket (or whatever the
mailbox type ends up being) during startup and can call the new
`tractor._discovery.get_root()` function to get a portal to the current
root actor in their tree. The main reason for adding this atm is to
support nested child actors gaining access to the root's tty lock for
debugging.
Also, when a channel disconnects from a message loop, might as well kill
all its rpc tasks.
It's not like any of this code is really being used anyway since we
aren't indefinitely blocking for cancelled subactors to terminate (yet).
Drop the `do_hard_kill()` bit for now and just rely on the underlying
process api. Oh, and mark the nursery as cancelled asap.
Seems like the request task cancel scope is actually solving all the
deadlock issues and masking SIGINT isn't changing much behaviour at all.
I think let's keep it unmasked for now in case it does turn out useful
in cancelling from unrecoverable states while in debug.
This is needed in order to avoid the deadlock condition where
a child actor is waiting on the root actor's tty lock but it's parent
(possibly the root) is waiting on it to terminate after sending a cancel
request. The solution is simple: create a cancel scope around the
request in the child and always cancel it when a cancel request from the
parent arrives.
There seems to be no good reason not too since our cancellation
machinery/protocol should do this work when the root receives the
signal. This also (hopefully) helps with some debugging race condition
stuff.
This seems to prevent a certain class of bugs to do with the root actor
cancelling local tasks and getting into deadlock while children are
trying to acquire the tty lock. I'm not sure it's the best idea yet
since you're pretty much guaranteed to get "stuck" if a child activates
the debugger after the root has been cancelled (at least "stuck" in
terms of SIGINT being ignored). That kinda race condition seems to still
exist somehow: a child can "beat" the root to activating the tty lock
and the parent is stuck waiting on the child to terminate via its
nursery.
This aids with tearing down resources **after** the crash handling and
debugger have completed. Leaving this internal for now but should
eventually get a public convenience function like
`tractor.context_stack()`.
Keep an actor local (bool) flag which determines if there is already
a running debugger instance for the current process. If another task
tries to enter in this case, simply ignore it since allowing entry may
result in a deadlock where the new task will be sync waiting on the
parent stdio lock (a case that will never arrive due to the current
debugger's active use of it).
In the future we may want to allow FIFO queueing of local tasks where
instead of ignoring re-entrant breakpoints we allow tasks to async wait
for debugger release, though not sure the implications of that since
you'd likely want to support switching the debugger to the new task and
that could cause deadlocks where tasks are inter-dependent. It may be
more sane to just error on multiple breakpoint requests within an actor.
This is the first step in addressing #113 and the initial support
of #130. Basically this allows (sub)processes to engage the `pdbpp`
debug machinery which read/writes the root actor's tty but only in
a FIFO semaphored way such that no two processes are using it
simultaneously. That means you can have multiple actors enter a trace or
crash and run the debugger in a sensible way without clobbering each
other's access to stdio. It required adding some "tear down hooks" to
a custom `pdbpp.Pdb` type such that we release a child's lock on the
parent on debugger exit (in this case when either of the "continue" or
"quit" commands are issued to the debugger console).
There's some code left commented in anticipation of full support for
issue #130 where we're need to actually capture and feed stdin to the
target (remote) actor which won't necessarily being running on the same
host.
Allow entering and attaching to a `pdb` instance in a child process.
The current hackery is to have the child make an rpc to the parent and
ask it to hijack stdin, once complete the child enters a `pdb` blocking
method. The parent then relays all stdin input to the child thus
controlling the "remote" debugger.
A few things were added to accomplish this:
- tracking the mapping of subactors to their parent nurseries
- in the root actor, cancelling all nurseries under the root `trio` task
on cancellation (i.e. `Actor.cancel()`)
- pass a "runtime vars" map down the actor tree for propagating global state
In an effort acquire more deterministic actor cancellation,
this adds a clearer and more resilient (whilst possibly a bit
slower) internal nursery structure with explicit semantics for
clarifying the task-scope shutdown sequence.
Namely, on cancellation, the explicit steps are now:
- cancel all currently running rpc tasks and wait
for them to complete
- cancel the channel server and wait for it to complete
- cancel the msg loop for the channel with the immediate parent
- de-register with arbiter if possible
- wait on remaining connections to release
- exit process
To accomplish this add a new nursery called the "service nursery" which
spawns all rpc tasks **instead of using** the "root nursery". The root
is now used solely for async launching the msg loop for the primary
channel with the parent such that it is (nearly) the last thing torn
down on cancellation.
In the future it should also be possible to have `self.cancel()` return
a result to the parent once the runtime is sure that the rest of the
shutdown is atomic; this would allow for a true unbounded shield in
`Portal.cancel_actor()`. This will likely require that the error
handling blocks in `Actor._async_main()` are moved "inside" the root
nursery block such that the msg loop with the parent truly is the last
thing to terminate.