It's not like any of this code is really being used anyway since we
aren't indefinitely blocking for cancelled subactors to terminate (yet).
Drop the `do_hard_kill()` bit for now and just rely on the underlying
process api. Oh, and mark the nursery as cancelled asap.
Seems like the request task cancel scope is actually solving all the
deadlock issues and masking SIGINT isn't changing much behaviour at all.
I think let's keep it unmasked for now in case it does turn out useful
in cancelling from unrecoverable states while in debug.
This is needed in order to avoid the deadlock condition where
a child actor is waiting on the root actor's tty lock but it's parent
(possibly the root) is waiting on it to terminate after sending a cancel
request. The solution is simple: create a cancel scope around the
request in the child and always cancel it when a cancel request from the
parent arrives.
There seems to be no good reason not too since our cancellation
machinery/protocol should do this work when the root receives the
signal. This also (hopefully) helps with some debugging race condition
stuff.
This seems to prevent a certain class of bugs to do with the root actor
cancelling local tasks and getting into deadlock while children are
trying to acquire the tty lock. I'm not sure it's the best idea yet
since you're pretty much guaranteed to get "stuck" if a child activates
the debugger after the root has been cancelled (at least "stuck" in
terms of SIGINT being ignored). That kinda race condition seems to still
exist somehow: a child can "beat" the root to activating the tty lock
and the parent is stuck waiting on the child to terminate via its
nursery.
This aids with tearing down resources **after** the crash handling and
debugger have completed. Leaving this internal for now but should
eventually get a public convenience function like
`tractor.context_stack()`.
Keep an actor local (bool) flag which determines if there is already
a running debugger instance for the current process. If another task
tries to enter in this case, simply ignore it since allowing entry may
result in a deadlock where the new task will be sync waiting on the
parent stdio lock (a case that will never arrive due to the current
debugger's active use of it).
In the future we may want to allow FIFO queueing of local tasks where
instead of ignoring re-entrant breakpoints we allow tasks to async wait
for debugger release, though not sure the implications of that since
you'd likely want to support switching the debugger to the new task and
that could cause deadlocks where tasks are inter-dependent. It may be
more sane to just error on multiple breakpoint requests within an actor.
This is the first step in addressing #113 and the initial support
of #130. Basically this allows (sub)processes to engage the `pdbpp`
debug machinery which read/writes the root actor's tty but only in
a FIFO semaphored way such that no two processes are using it
simultaneously. That means you can have multiple actors enter a trace or
crash and run the debugger in a sensible way without clobbering each
other's access to stdio. It required adding some "tear down hooks" to
a custom `pdbpp.Pdb` type such that we release a child's lock on the
parent on debugger exit (in this case when either of the "continue" or
"quit" commands are issued to the debugger console).
There's some code left commented in anticipation of full support for
issue #130 where we're need to actually capture and feed stdin to the
target (remote) actor which won't necessarily being running on the same
host.
Allow entering and attaching to a `pdb` instance in a child process.
The current hackery is to have the child make an rpc to the parent and
ask it to hijack stdin, once complete the child enters a `pdb` blocking
method. The parent then relays all stdin input to the child thus
controlling the "remote" debugger.
A few things were added to accomplish this:
- tracking the mapping of subactors to their parent nurseries
- in the root actor, cancelling all nurseries under the root `trio` task
on cancellation (i.e. `Actor.cancel()`)
- pass a "runtime vars" map down the actor tree for propagating global state
In an effort acquire more deterministic actor cancellation,
this adds a clearer and more resilient (whilst possibly a bit
slower) internal nursery structure with explicit semantics for
clarifying the task-scope shutdown sequence.
Namely, on cancellation, the explicit steps are now:
- cancel all currently running rpc tasks and wait
for them to complete
- cancel the channel server and wait for it to complete
- cancel the msg loop for the channel with the immediate parent
- de-register with arbiter if possible
- wait on remaining connections to release
- exit process
To accomplish this add a new nursery called the "service nursery" which
spawns all rpc tasks **instead of using** the "root nursery". The root
is now used solely for async launching the msg loop for the primary
channel with the parent such that it is (nearly) the last thing torn
down on cancellation.
In the future it should also be possible to have `self.cancel()` return
a result to the parent once the runtime is sure that the rest of the
shutdown is atomic; this would allow for a true unbounded shield in
`Portal.cancel_actor()`. This will likely require that the error
handling blocks in `Actor._async_main()` are moved "inside" the root
nursery block such that the msg loop with the parent truly is the last
thing to terminate.
Always shield waiting for he process and always run
``trio.Process.__aexit__()`` on teardown. This enforces
that shutdown happens to due cancellation triggered inside
the sub-actor instead of the process being killed externally
by the parent.
Trio will kill subprocesses via `Process.__aexit__()` using a `finally:`
block (which, yes, will get triggered on cancellation) so we avoid that
until true process "tear down" since subactors do many things during
graceful shutdown (such as de-registering from the name discovery
system). Oddly this only seems to be an issue during cancellation of
infinite stream consumption.
Resolves#141
In order to have reliable subactor startup we need the following
sequence to take place:
- connect to the parent actor, handshake and receive runtime state
- load exposed modules into memory
- start the channel server up fully using the provided bind address
- finally, start processing new messages from the parent
Add a bunch more comments to clarify all this.
Using the context manager interface does some extra teardown beyond simply
calling `.wait()`. Pass the subactor's "uid" on the exec line for
debugging purposes when monitoring the process tree from the OS.
Hard code the child script module path to avoid a double import warning.
This is an edit to factor out changes needed for the `asyncio` in guest mode
integration (which currently isn't tested well) so that later more pertinent
changes (which are tested well) can be rebased off of this branch and
merged into mainline sooner. The *infect_asyncio* branch will need to be
rebased onto this branch as well before merge to mainline.
This is an initial solution for #120.
Allow spawning `asyncio` based actors which run `trio` in guest
mode. This enables spawning `tractor` actors on top of the `asyncio`
event loop whilst still leveraging the SC focused internal actor
supervision machinery. Add a `tractor.to_syncio.run()` api to allow
spawning tasks on the `asyncio` loop from an embedded (remote) `trio`
task and return or stream results all the way back through the `tractor`
IPC system using a very similar api to portals.
One outstanding problem is getting SC around calls to
`asyncio.create_task()`. Currently a task that crashes isn't able to
easily relay the error to the embedded `trio` task without us fully
enforcing the portals based message protocol (which seems superfluous
given the error ref is in process). Further experiments using `anyio`
task groups may alleviate this.
The logic in the `ActorNursery` block is critical to cancellation
semantics and in particular, understanding how supervisor strategies are
invoked. Stick in a bunch of explanatory comments to clear up these
details and also prepare to introduce more supervisor strats besides
the current one-cancels-all approach.
Instead of hackery trying to map modules manually from the filesystem
let Python do all the work by simply copying what ``multiprocessing``
does to "fixup the __main__ module" in spawned subprocesses. The new
private module ``_mp_fixup_main.py`` is simply cherry picked code from
``multiprocessing.spawn`` which does just that. We only need these
"fixups" when using a backend other then ``multiprocessing``; for
now just when using ``trio_run_in_process``.
Thanks to @salotz for pointing out that the first example in the docs
was broken. Though it's somewhat embarrassing this might also explain
the problem in #79 and certain issues in #59...
The solution here is to import the target RPC module using the its
unique basename and absolute filepath in the sub-actor that requires it.
Special handling for `__main__` and `__mp_main__` is needed since the
spawned subprocess will have no knowledge about these parent-
-state-specific module variables. Solution: map the modules name to the
respective module file basename in the child process since the module
variables will of course have different values in children.
Add a `--spawn-backend` option which can be set to one of {'mp',
'trio_run_in_process'} which will either run the test suite using the
`multiprocessing` or `trio-run-in-process` backend respectively.
Currently trying to run both in the same session can result in hangs
seemingly due to a lack of cleanup of forkservers / resource trackers
from `multiprocessing` which cause broken pipe errors on occasion (no
idea on the details).
For `test_cancellation.py::test_nested_multierrors`, use less nesting
when mp is used since it breaks if we push it too hard with the
whole recursive subprocess spawning thing...
Set `trio-run-in-process` as the default on *nix systems and
`multiprocessing`'s spawn method on Windows. Enable overriding the
default choice using `tractor._spawn.try_set_start_method()`. Allows
for easy runs of the test suite using a user chosen backend.
This took a ton of tinkering and a rework of the actor nursery tear down
logic. The main changes include:
- each subprocess is now spawned from inside a trio task
from one of two containing nurseries created in the body of
`tractor.open_nursery()`: one for `run_in_actor()` processes and one for
`start_actor()` "daemons". This is to address the need for
`trio-run-in_process.open_in_process()` opening a nursery which must
be closed from the same task that opened it. Using this same approach
for `multiprocessing` seems to work well. The nurseries are waited in
order (rip actors then daemon actors) during tear down which allows
for avoiding the recursive re-entry of `ActorNursery.wait()` handled
prior.
- pull out all the nested functions / closures that were in
`ActorNursery.wait()` and move into the `_spawn` module such that
that process shutdown logic takes place in each containing task's
code path. This allows for vastly simplifying `.wait()` to just contain an
event trigger which initiates process waiting / result collection.
Likely `.wait()` should just be removed since it can no longer be used
to synchronously wait on the actor nursery.
- drop `ActorNursery.__aenter__()` / `.__atexit__()` and move this
"supervisor" tear down logic into the closing block of `open_nursery()`.
This not only cleans makes the code more comprehensible it also
makes our nursery implementation look more like the one in `trio`.
Resolves#93
Get a few more things working:
- fail reliably when remote module loading goes awry
- do a real hacky job of module loading using `sys.path` stuffsies
- we're still totally borked when trying to spin up and quickly cancel
a bunch of subactors...
It's a small move forward I guess.
Prepend the actor and task names in each log emission. This makes
debugging much more sane since you can see from which process and
running task the log message originates from!
Resolves#13
If a nursery fails to cancel (some sub-actors presumably) then hard kill
the whole process tree to avoid hangs during a catastrophic failure.
This logic may get factored out (and changed) as we introduce custom
supervisor strategies.
`trio.MultiError` isn't an `Exception` (derived instead from
`BaseException`) so we have to specially catch it in the task
invocation machinery and ship it upwards (like regular errors)
since nurseries running in sub-actors can raise them.