Add `subint_forkserver` test-cancellation leak doc

New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.

Deats,
- TL;DR + ruled-out checks confirming the procs are
  ours (not piker / other tractor-embedding apps) —
  `/proc/$pid/cmdline` + cwd both resolve to this
  repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
  (plain `os.kill(SIGKILL)` to the direct child),
  not tree-scoped — grandchildren spawned during a
  multi-level cancel test get reparented to init and
  inherit the registry listen socket
- proposed fix directions ranked: (1) put each
  forkserver-spawned subactor in its own process-
  group (`os.setpgrp()` in fork-child) + tree-kill
  via `os.killpg(pgid, SIGKILL)` on teardown,
  (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
  `/proc/<pid>/task/*/children` walk. Vote: (1) —
  POSIX-standard, aligns w/ `start_new_session=True`
  semantics in `subprocess.Popen` / trio's
  `open_process`
- inline reproducer + cleanup recipe scoped to
  `$(pwd)/py314/bin/python.*pytest.*spawn-backend=
  subint_forkserver` so cleanup doesn't false-flag
  unrelated tractor procs (consistent w/
  `run-tests` skill's zombie-check guidance)

Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Gud Boi 2026-04-23 13:58:42 -04:00
parent aa7450974c
commit d0121960b9
1 changed files with 165 additions and 0 deletions

View File

@ -0,0 +1,165 @@
# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py`
Follow-up tracker: surfaced while wiring the new
`subint_forkserver` spawn backend into the full tractor
test matrix (step 2 of the post-backend-lands plan;
see also
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`).
## TL;DR
Running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly leaks
**exactly 5 `subint-forkserv` comm-named child processes**
after the pytest session exits. Both previously-run
sessions produced the same 5-process signature — not a
flake. Each leaked process holds a `LISTEN` on the
default registry TCP addr (`127.0.0.1:1616`), which
poisons any subsequent tractor test session that
defaults to that addr.
## Stopgap (not the real fix)
Multiple tests in `test_cancellation.py` were calling
`tractor.open_nursery()` **without** passing
`registry_addrs=[reg_addr]`, i.e. falling back on the
default `:1616`. The commit accompanying this doc wires
the `reg_addr` fixture through those tests so each run
gets a session-unique port — leaked zombies can no
longer poison **other** tests (they hold their own
unique port instead).
Tests touched (in `tests/test_cancellation.py`):
- `test_cancel_infinite_streamer`
- `test_some_cancels_all`
- `test_nested_multierrors`
- `test_cancel_via_SIGINT`
- `test_cancel_via_SIGINT_other_task`
This is a **suite-hygiene fix** — it doesn't close the
actual leak; it just stops the leak from blast-radiusing.
Zombie descendants still accumulate per run.
## The real bug (unfixed)
`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()`
(plain `os.kill(SIGKILL)` to the direct child pid) +
`proc.wait()` — does **not** reap grandchildren or
deeper descendants. When a cancellation test causes a
multi-level actor tree to tear down, the direct child
dies but its own children survive and get reparented to
init (PID 1), where they stay running with their
inherited FDs (including the registry listen socket).
**Symptom on repro:**
```
$ ss -tlnp 2>/dev/null | grep ':1616'
LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \
users:(("subint-forkserv",pid=211595,fd=17),
("subint-forkserv",pid=211585,fd=17),
("subint-forkserv",pid=211583,fd=17),
("subint-forkserv",pid=211576,fd=17),
("subint-forkserv",pid=211572,fd=17))
$ for p in 211572 211576 211583 211585 211595; do
cat /proc/$p/cmdline | tr '\0' ' '; echo; done
./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
tests/test_cancellation.py --timeout=30 --timeout-method=signal \
--tb=no -q --no-header
... (x5, all same cmdline — inherited from fork)
```
All 5 share the pytest cmdline because `os.fork()`
without `exec()` preserves the parent's argv. Their
comm-name (`subint-forkserv`) is the `thread_name` we
pass to the fork-worker thread in
`tractor.spawn._subint_forkserver.fork_from_worker_thread`.
## Why 5?
Not confirmed; guess is 5 = the parametrize cardinality
of one of the leaky tests (e.g. `test_some_cancels_all`
has 5 parametrize cases). Each param-case spawns a
nested tree; each leaks exactly one descendant. Worth
verifying by running each parametrize-case individually
and counting leaked procs per case.
## Ruled out
- **`:1616` collision from a different repo** (e.g.
piker): `/proc/$pid/cmdline` + `cwd` both resolve to
the tractor repo's `py314/` venv for all 5. These are
definitively spawned by our test run.
- **Parent-side `_ForkedProc.wait()` regressed**: the
direct child's teardown completes cleanly (exit-code
captured, `waitpid` returns); the 5 survivors are
deeper-descendants whose parent-side shim has no
handle on them. So the bug isn't in
`_ForkedProc.wait()` — it's in the lack of tree-
level descendant enumeration + reaping during nursery
teardown.
## Likely fix directions
1. **Process-group-scoped spawn + tree kill.** Put each
forkserver-spawned subactor into its own process
group (`os.setpgrp()` in the fork child), then on
teardown `os.killpg(pgid, SIGKILL)` to reap the
whole tree atomically. Simplest, most surgical.
2. **Subreaper registration.** Use
`PR_SET_CHILD_SUBREAPER` on the tractor root so
orphaned grandchildren reparent to the root rather
than init — then we can `waitpid` them from the
parent-side nursery teardown. More invasive.
3. **Explicit descendant enumeration at teardown.**
In `subint_forkserver_proc`'s finally block, walk
`/proc/<pid>/task/*/children` before issuing SIGKILL
to build a descendant-pid set; then kill + reap all
of them. Fragile (Linux-only, proc-fs-scan race).
Vote: **(1)** — clean, POSIX-standard, aligns with how
`subprocess.Popen` (and by extension `trio.lowlevel.
open_process`) handle tree-kill semantics on
kwargs-supplied `start_new_session=True`.
## Reproducer
```sh
# before: ensure clean env
ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean'
# run the leaky tests
./py314/bin/python -m pytest \
--spawn-backend=subint_forkserver \
tests/test_cancellation.py \
--timeout=30 --timeout-method=signal --tb=no -q --no-header
# observe: 5 leaked children now holding :1616
ss -tlnp 2>/dev/null | grep ':1616'
```
Expected output: `subint-forkserv` processes listed as
listeners on `:1616`. Cleanup:
```sh
pkill -9 -f \
"$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver"
```
## References
- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
— the current teardown shim; PID-scoped, not tree-
scoped.
- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
— the spawn backend whose `finally` block needs the
tree-kill fix.
- `tests/test_cancellation.py` — the surface where the
leak surfaces.
- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
— sibling tracker for a different forkserver-teardown
class (orphaned child doesn't respond to SIGINT); may
share root cause with this one once the fix lands.
- tractor issue #379 — subint backend tracking.