In an effort acquire more deterministic actor cancellation,
this adds a clearer and more resilient (whilst possibly a bit
slower) internal nursery structure with explicit semantics for
clarifying the task-scope shutdown sequence.
Namely, on cancellation, the explicit steps are now:
- cancel all currently running rpc tasks and wait
for them to complete
- cancel the channel server and wait for it to complete
- cancel the msg loop for the channel with the immediate parent
- de-register with arbiter if possible
- wait on remaining connections to release
- exit process
To accomplish this add a new nursery called the "service nursery" which
spawns all rpc tasks **instead of using** the "root nursery". The root
is now used solely for async launching the msg loop for the primary
channel with the parent such that it is (nearly) the last thing torn
down on cancellation.
In the future it should also be possible to have `self.cancel()` return
a result to the parent once the runtime is sure that the rest of the
shutdown is atomic; this would allow for a true unbounded shield in
`Portal.cancel_actor()`. This will likely require that the error
handling blocks in `Actor._async_main()` are moved "inside" the root
nursery block such that the msg loop with the parent truly is the last
thing to terminate.
Always shield waiting for he process and always run
``trio.Process.__aexit__()`` on teardown. This enforces
that shutdown happens to due cancellation triggered inside
the sub-actor instead of the process being killed externally
by the parent.
The real issue is if the root nursery gets cancelled prior to
de-registration with the arbiter. This doesn't seem easy to
reproduce by side effect of a KBI however that is how it was
discovered in practise.
There was code from the last de-registration fix PR that I had commented
(to do with shielding arbiter dereg steps in `Actor._async_main()`) because
the block didn't seem to make a difference under infinite streaming
tests. Turns out it **for sure** is needed under certain conditions (likely
if the actor's root nursery is cancelled prior to actor nursery exit).
This was an attempt to simulate the failure mode if you manually close the
stream **before** cancelling the containing **actor**.
More tests to come I guess.
Trio will kill subprocesses via `Process.__aexit__()` using a `finally:`
block (which, yes, will get triggered on cancellation) so we avoid that
until true process "tear down" since subactors do many things during
graceful shutdown (such as de-registering from the name discovery
system). Oddly this only seems to be an issue during cancellation of
infinite stream consumption.
Resolves#141
This truly reproduces #141. It turns out the problem only occurs when
we're cancelled in the middle of consuming "infinite streams".
Good news is this tests a lot of edge cases :)
In order to have reliable subactor startup we need the following
sequence to take place:
- connect to the parent actor, handshake and receive runtime state
- load exposed modules into memory
- start the channel server up fully using the provided bind address
- finally, start processing new messages from the parent
Add a bunch more comments to clarify all this.
- ease up on first stream test run deadline
- skip streaming tests in CI for mp backend, period
- give up on > 1 depth nested spawning with mp
- completely give up on slow spawning on windows