Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:
1. **5-zombie leak holding `:1616`** — turned out to
be a self-inflicted cleanup bug: `pkill`-ing a bg
pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
the SC graceful cancel cascade entirely. Codified
the real fix — SIGINT-first ladder w/ bounded
wait before SIGKILL — in e5e2afb5 (`run-tests`
SKILL) and
`feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
hangs indefinitely** — the actual backend bug,
and it's a deadlock not a leak.
Deats,
- new diagnosis: all 5 procs are kernel-`S` in
`do_epoll_wait`; pytest-main's trio-cache workers
are in `os.waitpid` waiting for children that are
themselves waiting on IPC that never arrives —
graceful `Portal.cancel_actor` cascade never
reaches its targets
- tree-structure evidence: asymmetric depth across
two identical `run_in_actor` calls — child 1
(3 threads) spawns both its grandchildren; child 2
(1 thread) never completes its first nursery
`run_in_actor`. Smells like a race on fork-
inherited state landing differently per spawn
ordering
- new hypothesis: `os.fork()` from a subactor
inherits the ROOT parent's IPC listener FDs
transitively. Grandchildren end up with three
overlapping FD sets (own + direct-parent + root),
so IPC routing becomes ambiguous. Predicts bug
scales with fork depth — matches reality: single-
level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
reaches hard-kill path), `:1616` contention (fixed
by `reg_addr` fixture wiring), GIL starvation
(each subactor has its own OS process+GIL),
child-side KBI absorption (`_trio_main` only
catches KBI at `trio.run()` callsite, reached
only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
`closerange()`, (2) `FD_CLOEXEC` + audit,
(3) targeted FD cleanup via `actor.ipc_server`
handle, (4) `os.posix_spawn` w/ `file_actions`.
Vote: (3) — surgical, doesn't break the "no exec"
design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
level-spawn tests under the backend via
`@pytest.mark.skipon_spawn_backend(...)` until
fix lands
Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code