Refine `subint_forkserver` nested-cancel hang diagnosis
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:
1. **5-zombie leak holding `:1616`** — turned out to
be a self-inflicted cleanup bug: `pkill`-ing a bg
pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
the SC graceful cancel cascade entirely. Codified
the real fix — SIGINT-first ladder w/ bounded
wait before SIGKILL — in e5e2afb5 (`run-tests`
SKILL) and
`feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
hangs indefinitely** — the actual backend bug,
and it's a deadlock not a leak.
Deats,
- new diagnosis: all 5 procs are kernel-`S` in
`do_epoll_wait`; pytest-main's trio-cache workers
are in `os.waitpid` waiting for children that are
themselves waiting on IPC that never arrives —
graceful `Portal.cancel_actor` cascade never
reaches its targets
- tree-structure evidence: asymmetric depth across
two identical `run_in_actor` calls — child 1
(3 threads) spawns both its grandchildren; child 2
(1 thread) never completes its first nursery
`run_in_actor`. Smells like a race on fork-
inherited state landing differently per spawn
ordering
- new hypothesis: `os.fork()` from a subactor
inherits the ROOT parent's IPC listener FDs
transitively. Grandchildren end up with three
overlapping FD sets (own + direct-parent + root),
so IPC routing becomes ambiguous. Predicts bug
scales with fork depth — matches reality: single-
level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
reaches hard-kill path), `:1616` contention (fixed
by `reg_addr` fixture wiring), GIL starvation
(each subactor has its own OS process+GIL),
child-side KBI absorption (`_trio_main` only
catches KBI at `trio.run()` callsite, reached
only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
`closerange()`, (2) `FD_CLOEXEC` + audit,
(3) targeted FD cleanup via `actor.ipc_server`
handle, (4) `os.posix_spawn` w/ `file_actions`.
Vote: (3) — surgical, doesn't break the "no exec"
design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
level-spawn tests under the backend via
`@pytest.mark.skipon_spawn_backend(...)` until
fix lands
Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
parent
e5e2afb5f4
commit
b71705bdcd
|
|
@ -1,165 +1,333 @@
|
||||||
# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py`
|
# `subint_forkserver` backend: `test_cancellation.py` multi-level cancel cascade hang
|
||||||
|
|
||||||
Follow-up tracker: surfaced while wiring the new
|
Follow-up tracker: surfaced while wiring the new
|
||||||
`subint_forkserver` spawn backend into the full tractor
|
`subint_forkserver` spawn backend into the full tractor
|
||||||
test matrix (step 2 of the post-backend-lands plan;
|
test matrix (step 2 of the post-backend-lands plan).
|
||||||
see also
|
See also
|
||||||
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`).
|
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
|
||||||
|
— sibling tracker for a different forkserver-teardown
|
||||||
|
class which probably shares the same fundamental root
|
||||||
|
cause (fork-FD-inheritance across nested spawns).
|
||||||
|
|
||||||
## TL;DR
|
## TL;DR
|
||||||
|
|
||||||
Running `tests/test_cancellation.py` under
|
`tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]`
|
||||||
`--spawn-backend=subint_forkserver` reproducibly leaks
|
hangs indefinitely under our new backend. The hang is
|
||||||
**exactly 5 `subint-forkserv` comm-named child processes**
|
**inside the graceful IPC cancel cascade** — every actor
|
||||||
after the pytest session exits. Both previously-run
|
in the multi-level tree parks in `epoll_wait` waiting
|
||||||
sessions produced the same 5-process signature — not a
|
for IPC messages that never arrive. Not a hard-kill /
|
||||||
flake. Each leaked process holds a `LISTEN` on the
|
tree-reap issue (we don't reach the hard-kill fallback
|
||||||
default registry TCP addr (`127.0.0.1:1616`), which
|
path at all).
|
||||||
poisons any subsequent tractor test session that
|
|
||||||
defaults to that addr.
|
|
||||||
|
|
||||||
## Stopgap (not the real fix)
|
Working hypothesis (unverified): **`os.fork()` from a
|
||||||
|
subactor inherits the root parent's IPC listener socket
|
||||||
|
FDs**. When a first-level subactor forkserver-spawns a
|
||||||
|
grandchild, that grandchild inherits both its direct
|
||||||
|
spawner's FDs AND the root's FDs — IPC message routing
|
||||||
|
becomes ambiguous (or silently sends to the wrong
|
||||||
|
channel), so the cancel cascade can't reach its target.
|
||||||
|
|
||||||
Multiple tests in `test_cancellation.py` were calling
|
## Corrected diagnosis vs. earlier draft
|
||||||
`tractor.open_nursery()` **without** passing
|
|
||||||
`registry_addrs=[reg_addr]`, i.e. falling back on the
|
|
||||||
default `:1616`. The commit accompanying this doc wires
|
|
||||||
the `reg_addr` fixture through those tests so each run
|
|
||||||
gets a session-unique port — leaked zombies can no
|
|
||||||
longer poison **other** tests (they hold their own
|
|
||||||
unique port instead).
|
|
||||||
|
|
||||||
Tests touched (in `tests/test_cancellation.py`):
|
An earlier version of this doc claimed the root cause
|
||||||
|
was **"forkserver teardown doesn't tree-kill
|
||||||
|
descendants"** (SIGKILL only reaches the direct child,
|
||||||
|
grandchildren survive and hold TCP `:1616`). That
|
||||||
|
diagnosis was **wrong**, caused by conflating two
|
||||||
|
observations:
|
||||||
|
|
||||||
- `test_cancel_infinite_streamer`
|
1. *5-zombie leak holding :1616* — happened in my own
|
||||||
- `test_some_cancels_all`
|
workflow when I aborted a bg pytest task with
|
||||||
- `test_nested_multierrors`
|
`pkill` (SIGTERM/SIGKILL, not SIGINT). The abrupt
|
||||||
- `test_cancel_via_SIGINT`
|
kill skipped the graceful `ActorNursery.__aexit__`
|
||||||
- `test_cancel_via_SIGINT_other_task`
|
cancel cascade entirely, orphaning descendants to
|
||||||
|
init. **This was my cleanup bug, not a forkserver
|
||||||
|
teardown bug.** Codified the fix (SIGINT-first +
|
||||||
|
bounded wait before SIGKILL) in
|
||||||
|
`feedback_sc_graceful_cancel_first.md` +
|
||||||
|
`.claude/skills/run-tests/SKILL.md`.
|
||||||
|
2. *`test_nested_multierrors` hangs indefinitely* —
|
||||||
|
the real, separate, forkserver-specific bug
|
||||||
|
captured by this doc.
|
||||||
|
|
||||||
This is a **suite-hygiene fix** — it doesn't close the
|
The two symptoms are unrelated. The tree-kill / setpgrp
|
||||||
actual leak; it just stops the leak from blast-radiusing.
|
fix direction proposed earlier would not help (1) (SC-
|
||||||
Zombie descendants still accumulate per run.
|
graceful-cleanup is the right answer there) and would
|
||||||
|
not help (2) (the hang is in the cancel cascade, not
|
||||||
|
in the hard-kill fallback).
|
||||||
|
|
||||||
## The real bug (unfixed)
|
## Symptom
|
||||||
|
|
||||||
`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()`
|
Reproducer (py3.14, clean env):
|
||||||
(plain `os.kill(SIGKILL)` to the direct child pid) +
|
|
||||||
`proc.wait()` — does **not** reap grandchildren or
|
|
||||||
deeper descendants. When a cancellation test causes a
|
|
||||||
multi-level actor tree to tear down, the direct child
|
|
||||||
dies but its own children survive and get reparented to
|
|
||||||
init (PID 1), where they stay running with their
|
|
||||||
inherited FDs (including the registry listen socket).
|
|
||||||
|
|
||||||
**Symptom on repro:**
|
```sh
|
||||||
|
# preflight: ensure clean env
|
||||||
|
ss -tlnp 2>/dev/null | grep ':1616' && echo 'FOUL — cleanup first!' || echo 'clean'
|
||||||
|
|
||||||
```
|
|
||||||
$ ss -tlnp 2>/dev/null | grep ':1616'
|
|
||||||
LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \
|
|
||||||
users:(("subint-forkserv",pid=211595,fd=17),
|
|
||||||
("subint-forkserv",pid=211585,fd=17),
|
|
||||||
("subint-forkserv",pid=211583,fd=17),
|
|
||||||
("subint-forkserv",pid=211576,fd=17),
|
|
||||||
("subint-forkserv",pid=211572,fd=17))
|
|
||||||
|
|
||||||
$ for p in 211572 211576 211583 211585 211595; do
|
|
||||||
cat /proc/$p/cmdline | tr '\0' ' '; echo; done
|
|
||||||
./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
|
./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
|
||||||
tests/test_cancellation.py --timeout=30 --timeout-method=signal \
|
'tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]' \
|
||||||
--tb=no -q --no-header
|
--timeout=30 --timeout-method=thread --tb=short -v
|
||||||
... (x5, all same cmdline — inherited from fork)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
All 5 share the pytest cmdline because `os.fork()`
|
Expected: `pytest-timeout` fires at 30s with a thread-
|
||||||
without `exec()` preserves the parent's argv. Their
|
dump banner, but the process itself **remains alive
|
||||||
comm-name (`subint-forkserv`) is the `thread_name` we
|
after timeout** and doesn't unwedge on subsequent
|
||||||
pass to the fork-worker thread in
|
SIGINT. Requires SIGKILL to reap.
|
||||||
`tractor.spawn._subint_forkserver.fork_from_worker_thread`.
|
|
||||||
|
|
||||||
## Why 5?
|
## Evidence (tree structure at hang point)
|
||||||
|
|
||||||
Not confirmed; guess is 5 = the parametrize cardinality
|
All 5 processes are kernel-level `S` (sleeping) in
|
||||||
of one of the leaky tests (e.g. `test_some_cancels_all`
|
`do_epoll_wait` (trio's event loop waiting on I/O):
|
||||||
has 5 parametrize cases). Each param-case spawns a
|
|
||||||
nested tree; each leaks exactly one descendant. Worth
|
|
||||||
verifying by running each parametrize-case individually
|
|
||||||
and counting leaked procs per case.
|
|
||||||
|
|
||||||
## Ruled out
|
```
|
||||||
|
PID PPID THREADS NAME ROLE
|
||||||
- **`:1616` collision from a different repo** (e.g.
|
333986 1 2 subint-forkserv pytest main (the test body)
|
||||||
piker): `/proc/$pid/cmdline` + `cwd` both resolve to
|
333993 333986 3 subint-forkserv "child 1" spawner subactor
|
||||||
the tractor repo's `py314/` venv for all 5. These are
|
334003 333993 1 subint-forkserv grandchild errorer under child-1
|
||||||
definitively spawned by our test run.
|
334014 333993 1 subint-forkserv grandchild errorer under child-1
|
||||||
- **Parent-side `_ForkedProc.wait()` regressed**: the
|
333999 333986 1 subint-forkserv "child 2" spawner subactor (NO grandchildren!)
|
||||||
direct child's teardown completes cleanly (exit-code
|
|
||||||
captured, `waitpid` returns); the 5 survivors are
|
|
||||||
deeper-descendants whose parent-side shim has no
|
|
||||||
handle on them. So the bug isn't in
|
|
||||||
`_ForkedProc.wait()` — it's in the lack of tree-
|
|
||||||
level descendant enumeration + reaping during nursery
|
|
||||||
teardown.
|
|
||||||
|
|
||||||
## Likely fix directions
|
|
||||||
|
|
||||||
1. **Process-group-scoped spawn + tree kill.** Put each
|
|
||||||
forkserver-spawned subactor into its own process
|
|
||||||
group (`os.setpgrp()` in the fork child), then on
|
|
||||||
teardown `os.killpg(pgid, SIGKILL)` to reap the
|
|
||||||
whole tree atomically. Simplest, most surgical.
|
|
||||||
2. **Subreaper registration.** Use
|
|
||||||
`PR_SET_CHILD_SUBREAPER` on the tractor root so
|
|
||||||
orphaned grandchildren reparent to the root rather
|
|
||||||
than init — then we can `waitpid` them from the
|
|
||||||
parent-side nursery teardown. More invasive.
|
|
||||||
3. **Explicit descendant enumeration at teardown.**
|
|
||||||
In `subint_forkserver_proc`'s finally block, walk
|
|
||||||
`/proc/<pid>/task/*/children` before issuing SIGKILL
|
|
||||||
to build a descendant-pid set; then kill + reap all
|
|
||||||
of them. Fragile (Linux-only, proc-fs-scan race).
|
|
||||||
|
|
||||||
Vote: **(1)** — clean, POSIX-standard, aligns with how
|
|
||||||
`subprocess.Popen` (and by extension `trio.lowlevel.
|
|
||||||
open_process`) handle tree-kill semantics on
|
|
||||||
kwargs-supplied `start_new_session=True`.
|
|
||||||
|
|
||||||
## Reproducer
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# before: ensure clean env
|
|
||||||
ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean'
|
|
||||||
|
|
||||||
# run the leaky tests
|
|
||||||
./py314/bin/python -m pytest \
|
|
||||||
--spawn-backend=subint_forkserver \
|
|
||||||
tests/test_cancellation.py \
|
|
||||||
--timeout=30 --timeout-method=signal --tb=no -q --no-header
|
|
||||||
|
|
||||||
# observe: 5 leaked children now holding :1616
|
|
||||||
ss -tlnp 2>/dev/null | grep ':1616'
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output: `subint-forkserv` processes listed as
|
### Asymmetric tree depth
|
||||||
listeners on `:1616`. Cleanup:
|
|
||||||
|
|
||||||
```sh
|
The test's `spawn_and_error(breadth=2, depth=3)` should
|
||||||
pkill -9 -f \
|
have BOTH direct children spawning 2 grandchildren
|
||||||
"$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver"
|
each, going 3 levels deep. Reality:
|
||||||
|
|
||||||
|
- Child 1 (333993, 3 threads) DID spawn its two
|
||||||
|
grandchildren as expected — fully booted trio
|
||||||
|
runtime.
|
||||||
|
- Child 2 (333999, 1 thread) did NOT spawn any
|
||||||
|
grandchildren — clearly never completed its
|
||||||
|
nursery's first `run_in_actor`. Its 1-thread state
|
||||||
|
suggests the runtime never fully booted (no trio
|
||||||
|
worker threads for `waitpid`/IPC).
|
||||||
|
|
||||||
|
This asymmetry is the key clue: the two direct
|
||||||
|
children started identically but diverged. Probably a
|
||||||
|
race around fork-inherited state (listener FDs,
|
||||||
|
subactor-nursery channel state) that happens to land
|
||||||
|
differently depending on spawn ordering.
|
||||||
|
|
||||||
|
### Parent-side state
|
||||||
|
|
||||||
|
Thread-dump of pytest main (333986) at the hang:
|
||||||
|
|
||||||
|
- Main trio thread — parked in
|
||||||
|
`trio._core._io_epoll.get_events` (epoll_wait on
|
||||||
|
its event loop). Waiting for IPC from children.
|
||||||
|
- Two trio-cache worker threads — each parked in
|
||||||
|
`outcome.capture(sync_fn)` calling
|
||||||
|
`os.waitpid(child_pid, 0)`. These are our
|
||||||
|
`_ForkedProc.wait()` off-loads. They're waiting for
|
||||||
|
the direct children to exit — but children are
|
||||||
|
stuck in their own epoll_wait waiting for IPC from
|
||||||
|
the parent.
|
||||||
|
|
||||||
|
**It's a deadlock, not a leak:** the parent is
|
||||||
|
correctly running `soft_kill(proc, _ForkedProc.wait,
|
||||||
|
portal)` (graceful IPC cancel via
|
||||||
|
`Portal.cancel_actor()`), but the children never
|
||||||
|
acknowledge the cancel message (or the message never
|
||||||
|
reaches them through the tangled post-fork IPC).
|
||||||
|
|
||||||
|
## What's NOT the cause (ruled out)
|
||||||
|
|
||||||
|
- **`_ForkedProc.kill()` only SIGKILLs direct pid /
|
||||||
|
missing tree-kill**: doesn't apply — we never reach
|
||||||
|
the hard-kill path. The deadlock is in the graceful
|
||||||
|
cancel cascade.
|
||||||
|
- **Port `:1616` contention**: ruled out after the
|
||||||
|
`reg_addr` fixture-wiring fix; each test session
|
||||||
|
gets a unique port now.
|
||||||
|
- **GIL starvation / SIGINT pipe filling** (class-A,
|
||||||
|
`subint_sigint_starvation_issue.md`): doesn't apply
|
||||||
|
— each subactor is its own OS process with its own
|
||||||
|
GIL (not legacy-config subint).
|
||||||
|
- **Child-side `_trio_main` absorbing KBI**: grep
|
||||||
|
confirmed; `_trio_main` only catches KBI at the
|
||||||
|
`trio.run()` callsite, which is reached only if the
|
||||||
|
trio loop exits normally. The children here never
|
||||||
|
exit trio.run() — they're wedged inside.
|
||||||
|
|
||||||
|
## Hypothesis: FD inheritance across nested forks
|
||||||
|
|
||||||
|
`subint_forkserver_proc` calls
|
||||||
|
`fork_from_worker_thread()` which ultimately does
|
||||||
|
`os.fork()` from a dedicated worker thread. Standard
|
||||||
|
Linux/POSIX fork semantics: **the child inherits ALL
|
||||||
|
open FDs from the parent**, including listener
|
||||||
|
sockets, epoll fds, trio wakeup pipes, and the
|
||||||
|
parent's IPC channel sockets.
|
||||||
|
|
||||||
|
At root-actor fork-spawn time, the root's IPC server
|
||||||
|
listener FDs are open in the parent. Those get
|
||||||
|
inherited by child 1. Child 1 then forkserver-spawns
|
||||||
|
its OWN subactor (grandchild). The grandchild
|
||||||
|
inherits FDs from child 1 — but child 1's address
|
||||||
|
space still contains **the root's IPC listener FDs
|
||||||
|
too** (inherited at first fork). So the grandchild
|
||||||
|
has THREE sets of FDs:
|
||||||
|
|
||||||
|
1. Its own (created after becoming a subactor).
|
||||||
|
2. Its direct parent child-1's.
|
||||||
|
3. The ROOT's (grandparent's) — inherited transitively.
|
||||||
|
|
||||||
|
IPC message routing may be ambiguous in this tangled
|
||||||
|
state. Or a listener socket that the root thinks it
|
||||||
|
owns is actually open in multiple processes, and
|
||||||
|
messages sent to it go to an arbitrary one. That
|
||||||
|
would exactly match the observed "graceful cancel
|
||||||
|
never propagates".
|
||||||
|
|
||||||
|
This hypothesis predicts the bug **scales with fork
|
||||||
|
depth**: single-level forkserver spawn
|
||||||
|
(`test_subint_forkserver_spawn_basic`) works
|
||||||
|
perfectly, but any test that spawns a second level
|
||||||
|
deadlocks. Matches observations so far.
|
||||||
|
|
||||||
|
## Fix directions (to validate)
|
||||||
|
|
||||||
|
### 1. `close_fds=True` equivalent in `fork_from_worker_thread()`
|
||||||
|
|
||||||
|
`subprocess.Popen` / `trio.lowlevel.open_process` have
|
||||||
|
`close_fds=True` by default on POSIX — they
|
||||||
|
enumerate open FDs in the child post-fork and close
|
||||||
|
everything except stdio + any explicitly-passed FDs.
|
||||||
|
Our raw `os.fork()` doesn't. Adding the equivalent to
|
||||||
|
our `_worker` prelude would isolate each fork
|
||||||
|
generation's FD set.
|
||||||
|
|
||||||
|
Implementation sketch in
|
||||||
|
`tractor.spawn._subint_forkserver.fork_from_worker_thread._worker`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _worker() -> None:
|
||||||
|
pid: int = os.fork()
|
||||||
|
if pid == 0:
|
||||||
|
# CHILD: close inherited FDs except stdio + the
|
||||||
|
# pid-pipe we just opened.
|
||||||
|
keep: set[int] = {0, 1, 2, rfd, wfd}
|
||||||
|
import resource
|
||||||
|
soft, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
|
||||||
|
os.closerange(3, soft) # blunt; or enumerate /proc/self/fd
|
||||||
|
# ... then child_target() as before
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Problem: overly aggressive — closes FDs the
|
||||||
|
grandchild might legitimately need (e.g. its parent's
|
||||||
|
IPC channel for the spawn-spec handshake, if we rely
|
||||||
|
on that). Needs thought about which FDs are
|
||||||
|
"inheritable and safe" vs. "inherited by accident".
|
||||||
|
|
||||||
|
### 2. Cloexec on tractor's own FDs
|
||||||
|
|
||||||
|
Set `FD_CLOEXEC` on tractor-created sockets (listener
|
||||||
|
sockets, IPC channel sockets, pipes). This flag
|
||||||
|
causes automatic close on `execve`, but since we
|
||||||
|
`fork()` without `exec()`, this alone doesn't help.
|
||||||
|
BUT — combined with a child-side explicit close-
|
||||||
|
non-cloexec loop, it gives us a way to mark "my
|
||||||
|
private FDs" vs. "safe to inherit". Most robust, but
|
||||||
|
requires tractor-wide audit.
|
||||||
|
|
||||||
|
### 3. Explicit FD cleanup in `_ForkedProc`/`_child_target`
|
||||||
|
|
||||||
|
Have `subint_forkserver_proc`'s `_child_target`
|
||||||
|
closure explicitly close the parent-side IPC listener
|
||||||
|
FDs before calling `_actor_child_main`. Requires
|
||||||
|
being able to enumerate "the parent's listener FDs
|
||||||
|
that the child shouldn't keep" — plausible via
|
||||||
|
`Actor.ipc_server`'s socket objects.
|
||||||
|
|
||||||
|
### 4. Use `os.posix_spawn` with explicit `file_actions`
|
||||||
|
|
||||||
|
Instead of raw `os.fork()`, use `os.posix_spawn()`
|
||||||
|
which supports explicit file-action specifications
|
||||||
|
(close this FD, dup2 that FD). Cleaner semantics, but
|
||||||
|
probably incompatible with our "no exec" requirement
|
||||||
|
(subint_forkserver is a fork-without-exec design).
|
||||||
|
|
||||||
|
**Likely correct answer: (3) — targeted FD cleanup
|
||||||
|
via `actor.ipc_server` handle.** (1) is too blunt,
|
||||||
|
(2) is too wide-ranging, (4) changes the spawn
|
||||||
|
mechanism.
|
||||||
|
|
||||||
|
## Reproducer (standalone, no pytest)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# save as /tmp/forkserver_nested_hang_repro.py (py3.14+)
|
||||||
|
import trio, tractor
|
||||||
|
|
||||||
|
async def assert_err():
|
||||||
|
assert 0
|
||||||
|
|
||||||
|
async def spawn_and_error(breadth: int = 2, depth: int = 1):
|
||||||
|
async with tractor.open_nursery() as n:
|
||||||
|
for i in range(breadth):
|
||||||
|
if depth > 0:
|
||||||
|
await n.run_in_actor(
|
||||||
|
spawn_and_error,
|
||||||
|
breadth=breadth,
|
||||||
|
depth=depth - 1,
|
||||||
|
name=f'spawner_{i}_{depth}',
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
await n.run_in_actor(
|
||||||
|
assert_err,
|
||||||
|
name=f'errorer_{i}',
|
||||||
|
)
|
||||||
|
|
||||||
|
async def _main():
|
||||||
|
async with tractor.open_nursery() as n:
|
||||||
|
for i in range(2):
|
||||||
|
await n.run_in_actor(
|
||||||
|
spawn_and_error,
|
||||||
|
name=f'top_{i}',
|
||||||
|
breadth=2,
|
||||||
|
depth=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
from tractor.spawn._spawn import try_set_start_method
|
||||||
|
try_set_start_method('subint_forkserver')
|
||||||
|
with trio.fail_after(20):
|
||||||
|
trio.run(_main)
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected (current): hangs on `trio.fail_after(20)`
|
||||||
|
— children never ack the error-propagation cancel
|
||||||
|
cascade. Pattern: top 2 direct children, 4
|
||||||
|
grandchildren, 1 errorer deadlocks while trying to
|
||||||
|
unwind through its parent chain.
|
||||||
|
|
||||||
|
After fix: `trio.TooSlowError`-free completion; the
|
||||||
|
root's `open_nursery` receives the
|
||||||
|
`BaseExceptionGroup` containing the `AssertionError`
|
||||||
|
from the errorer and unwinds cleanly.
|
||||||
|
|
||||||
|
## Stopgap (landed)
|
||||||
|
|
||||||
|
Until the fix lands, `test_nested_multierrors` +
|
||||||
|
related multi-level-spawn tests can be skip-marked
|
||||||
|
under `subint_forkserver` via
|
||||||
|
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
|
||||||
|
reason='...')`. Cross-ref this doc.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
|
- `tractor/spawn/_subint_forkserver.py::fork_from_worker_thread`
|
||||||
— the current teardown shim; PID-scoped, not tree-
|
— the primitive whose post-fork FD hygiene is
|
||||||
scoped.
|
probably the culprit.
|
||||||
- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
|
- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
|
||||||
— the spawn backend whose `finally` block needs the
|
— the backend function that orchestrates the
|
||||||
tree-kill fix.
|
graceful cancel path hitting this bug.
|
||||||
- `tests/test_cancellation.py` — the surface where the
|
- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
|
||||||
leak surfaces.
|
— the `trio.Process`-compatible shim; NOT the
|
||||||
|
failing component (confirmed via thread-dump).
|
||||||
|
- `tests/test_cancellation.py::test_nested_multierrors`
|
||||||
|
— the test that surfaced the hang.
|
||||||
- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
|
- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
|
||||||
— sibling tracker for a different forkserver-teardown
|
— sibling hang class; probably same underlying
|
||||||
class (orphaned child doesn't respond to SIGINT); may
|
fork-FD-inheritance root cause.
|
||||||
share root cause with this one once the fix lands.
|
|
||||||
- tractor issue #379 — subint backend tracking.
|
- tractor issue #379 — subint backend tracking.
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue