New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.
Deats,
- TL;DR + ruled-out checks confirming the procs are
ours (not piker / other tractor-embedding apps) —
`/proc/$pid/cmdline` + cwd both resolve to this
repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
(plain `os.kill(SIGKILL)` to the direct child),
not tree-scoped — grandchildren spawned during a
multi-level cancel test get reparented to init and
inherit the registry listen socket
- proposed fix directions ranked: (1) put each
forkserver-spawned subactor in its own process-
group (`os.setpgrp()` in fork-child) + tree-kill
via `os.killpg(pgid, SIGKILL)` on teardown,
(2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
`/proc/<pid>/task/*/children` walk. Vote: (1) —
POSIX-standard, aligns w/ `start_new_session=True`
semantics in `subprocess.Popen` / trio's
`open_process`
- inline reproducer + cleanup recipe scoped to
`$(pwd)/py314/bin/python.*pytest.*spawn-backend=
subint_forkserver` so cleanup doesn't false-flag
unrelated tractor procs (consistent w/
`run-tests` skill's zombie-check guidance)
Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code