Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.
Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.
Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).
`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.
Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
hint in the rendered dump header so users know where
to find the artifact even when stdio is captured.
- swap the inline `f' |_{actor}'` line for a
`_pformat.nest_from_op` rendering of `actor_repr`
(matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
branches now note `(trio_token captured: <bool>)`
so it's obvious from the log whether the full-tree
path is wired.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two cleanup tweaks in `_main_thread_forkserver`:
Doc, "what survives the fork?" section — expand the
"non-calling threads are gone in the child" claim with
the precise execution-vs-memory split that reconciles
this module's prior framing with trio's (canonical
[python-trio/trio#1614][trio-1614]) "leaked stacks"
framing:
- execution-side: only the calling thread runs
post-fork; all others never execute another
instruction.
- memory-side: those non-running threads' stacks +
per-thread heap structures are still COW-inherited
as orphaned bytes — what trio means by "leaked".
Same POSIX reality, opposite sides; the table is
extended to a 4-col `parent | child (executing) |
child (memory)` layout to make both views explicit.
Also blank-line-padded the bulleted hazard classes
for cleaner markdown rendering.
[trio-1614]: https://github.com/python-trio/trio/issues/1614
Code, `_close_inherited_fds()` log noise — split the
catch-all `except OSError` into:
- `EBADF` — benign race where the dirfd that
`os.listdir('/proc/self/fd')` itself opened ends up
in `candidates`, then auto-closes before the loop
reaches it. Demote to `log.debug()` + `continue`;
prior `log.exception` drowned the post-fork log
channel with stack traces every spawn.
- other errnos (EIO / EPERM / EINTR / ...) keep the
loud `log.exception` surface — those ARE genuinely
unexpected.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.
Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
* exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
inherit it via environ,
* installs the handler in pytest itself via
`enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
existing `_debug_mode` gate to ALSO fire when
`TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
install the same handler at runtime startup.
Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:
- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
always written) — guaranteed-readable artifact even
under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
tty; ignored if device is missing.
Other,
- squelch the benign `RuntimeWarning` ("coroutine method
'asend'/'athrow' was never awaited") from
`stackscope._glue`'s import-time async-gen type
introspection so `--enable-stackscope` setup stays
quiet.
- log msg in the `_runtime` ImportError branch now
mentions `--enable-stackscope` alongside debug-mode.
Usage,
pytest --enable-stackscope -k <hang-test>
# in another shell, find the pid + signal:
kill -USR1 <pytest-or-subactor-pid>
# tail the artifact:
tail -f /tmp/tractor-stackscope-<pid>.log
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Mirror `060f7d24`'s pattern (backend-aware timeout in
`maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard
`trio.fail_after` cap. Fork-based backends pay per-spawn
fork+IPC-handshake cost which stacks over `cpus - 1`
sequential `n.run_in_actor()` calls; empirically 12s
flakes on `main_thread_forkserver` under UDS
cross-pytest contention (#451 / #452).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 12s (unchanged)
Hoist the timeout-pick out of the `main()` closure so the
dispatch happens once in the trio task rather than
re-evaluating per spawn.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Default `timeout` from `int = 3` → `int|None = None`;
when unset, pick a backend-aware value. Fork-based
backends (`main_thread_forkserver`) need real headroom
bc actor spawn + IPC ctx-exit + msg-validation error
path is much heavier than under `trio` backend —
especially under cross-pytest-stream contention (#451).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 3s (unchanged)
Empirical flake history that motivated 30s as the floor
on fork backends (all from `test_basic_payload_spec`):
- 3s → all-valid variant flaked w/ `TooSlowError`
- 8s → `invalid-return` variant flaked w/ `Cancelled`
(surfaced instead of `MsgTypeError` bc the
outer `fail_after` fired mid-error-path)
- 15s → flaked under cross-pytest-stream contention
30s gives plenty of headroom while still failing-loud
on a genuine hang. Callers can opt out by passing an
explicit `timeout=` kw.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`timeout = 200` was firing via SIGALRM (the default
`method='signal'`) which synchronously raises `Failed` in
trio's main thread mid-`epoll.poll()`, abandoning trio's
runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half-
installed. EVERY subsequent `trio.run()` in the same pytest
session then bails with
`RuntimeError: Attempted to call run() from inside a run()`.
Empirical impact: a session that hits a single 200s hang
cascades into 30-40 false-positive failures across every
downstream test file that uses `trio.run`. Recent UDS run
saw 1 real timeout (`test_unregistered_err_still_relayed`)
poison 38 sibling tests with cascade-fails — a debugging
nightmare.
Same architectural bug we already documented in
`tests/test_advanced_streaming.py::test_dynamic_pub_sub`
(see its module-level NOTE) — both `pytest-timeout`
enforcement modes are incompatible with trio under fork-
based spawn backends. Now scoped session-wide.
For tests that legitimately need a wall-clock cap, the
canonical pattern is `with trio.fail_after(N):` INSIDE the
test — trio's own `Cancelled` machinery cleanly unwinds
the actor nursery without disturbing global state.
For CI: rely on job-level wall-clock timeouts (e.g. GitHub
Actions `timeout-minutes`) to abort genuinely-stuck suites.
`pyproject.toml` comment block spells this all out so a
future contributor doesn't reach back for `timeout =` and
re-introduce the bug.
ALSO, bump `xonsh` to at least `0.23.0` release.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop `@pytest.mark.timeout(...)` for the per-test wall-clock
cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)`
inside `main()` instead.
Both pytest-timeout enforcement modes are incompatible with
trio under fork-based backends:
- `method='signal'` (SIGALRM) synchronously raises `Failed`
in trio's main thread mid-`epoll.poll()`, leaving
`GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got
abandoned") so EVERY subsequent `trio.run()` in the same
pytest process bails with
`RuntimeError: Attempted to call run() from inside a run()`
— full-session poison.
- `method='thread'` calls `_thread.interrupt_main()` which
can let the KBI escape trio's `KIManager` under fork-
cascade teardown races and bubble out of pytest entirely
— kills the whole session.
`trio.fail_after()` keeps cancellation inside the trio loop:
- Raises `TooSlowError` cleanly through the open-nursery's
cancel cascade.
- Doesn't disturb any out-of-band signal/thread state.
- Failure stays scoped to the single test — no cross-test
global state corruption either way.
Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub`
go from 5/10 fail (with global-state poison) to 3/10 fail
(no poison, all sibling tests still pass). The ~30%
remaining flake rate is a genuine fork-cancel-cascade
hang — separate from this fix but no longer contaminates.
Module-level NOTE comment explains the rationale so future
readers don't re-introduce the bug.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.
Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...
Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.
Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".
Switch to per-call `random.randint()` salted with `os.getpid()`
so:
- within one session: two calls return distinct ports — e.g.
`test_tpt_bind_addrs::bind-subset-reg` now actually gets two
different reg addrs on the TCP backend (it was silently
duplicating before),
- across parallel sessions: pid salt biases each process's
port choices apart, making cross-run collisions
vanishingly rare.
Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `test_subint_forkserver_key_errors_cleanly` — a tn-tier
regression guard that pins down the variant-2 reservation
contract: the `'subint_forkserver'` key in
`_spawn._methods` MUST raise `NotImplementedError` today,
not silently dispatch to `main_thread_forkserver_proc`.
The transient alias-state existed briefly during the rename
(commit `57dae0e4`'s "Split forkserver backend into variant
1/2 mods" landed the alias; `5e83881f` flipped it to the
stub). Without a guard, a future refactor could easily
re-collapse the two keys back to a single coro and silently
break the variant-1 / variant-2 contract.
Also asserts the stub's error msg surfaces the two pointers
an operator hitting it actually needs:
- `'main_thread_forkserver'` — the working backend they
prolly meant,
- `'msgspec#1026'` — the upstream blocker that has to land
before variant-2 can ship.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:
- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
`main_thread_forkserver`, drop the stub-only `subint_forkserver`
entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
reason-text since the `mp.SharedMemory(track=False)`
+ resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
`_testing.pytest`, `_subint.py`, `pyproject.toml`,
`test_cancellation.py`, `test_registrar.py` — refs to variant-1
backend updated.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Rename `tests/spawn/test_subint_forkserver.py` →
`test_main_thread_forkserver.py` and migrate its imports +
internal refs to the new canonical names:
- `fork_from_worker_thread`, `wait_child` → from
`tractor.spawn._main_thread_forkserver`.
- `run_subint_in_worker_thread` → still from `_subint_forkserver`
(variant-2 primitive).
- Module docstring + tier-3 fixture + the `*_spawn_basic` test fn
renamed for variant-1-honesty.
- Orphan-harness subprocess argv flipped from `'subint_forkserver'`
→ `'main_thread_forkserver'`.
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` imports split
the same way.
`tractor/spawn/_subint_forkserver.py` drops the backward- compat
re-exports of the fork primitives — the only consumers (test file
+ smoketest) now import from `_main_thread_forkserver` directly.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reduce `_subint_forkserver.py` to its variant-2 placeholder shape:
- Add `subint_forkserver_proc` async stub raising `NotImplementedError`
with a redirect msg pointing at the working variant-1 backend
(`main_thread_forkserver`), jcrist/msgspec#1026 (upstream PEP 684
blocker), and #379 (subint umbrella).
- `tractor.spawn._spawn._methods['subint_forkserver']` now dispatches to
the stub instead of aliasing the variant-1 coroutine
— `--spawn-backend=subint_forkserver` errors cleanly.
- Drop now-dead module-scope: `ChildSigintMode`
/ `_DEFAULT_CHILD_SIGINT` defs, `_has_subints` try/except (replaced
with import from `._subint`), unused imports (`partial`, `Literal`,
`sys`, msgtypes/pretty_struct, `current_actor`,
`cancel_on_completion`/`soft_kill`, `_server` TYPE_CHECKING).
- Backward-compat re-exports of fork primitives kept until the follow-up
commit migrates external test imports.
- `tests/spawn/test_subint_forkserver.py::forkserver_spawn_method`
fixture: flip hardcoded `'subint_forkserver'`
→ `'main_thread_forkserver'` so the test still exercises the working
backend (full file rename comes in the test-import migration commit).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The `subint_forkserver` name was always aspirational —
today's impl forks from a regular main-interp worker
thread and the child runs trio on its own main interp;
NO subinterp anywhere in parent or child. Splitting the
backend into two clearly-named variants drops the lie:
- **variant 1** — `main_thread_forkserver` (the working
impl). New `SpawnMethodKey` literal + `_methods`
dispatch entry + `_runtime.Actor._from_parent()`
match-arm. The spawn-coro `subint_forkserver_proc`
moves to `_main_thread_forkserver` and is renamed
`main_thread_forkserver_proc()`.
- **variant 2** — `subint_forkserver` (future, reserved).
Module shrinks to a placeholder describing the
variant-2 design (subint-isolated child runtime, gated
on jcrist/msgspec#1026 + PEP 684). Today the legacy
`'subint_forkserver'` key aliases to
`main_thread_forkserver_proc` so existing
`--spawn-backend=subint_forkserver` invocations keep
working; flipped to a `NotImplementedError` stub in a
follow-up.
Deats,
- `Actor._from_parent()` spawn-method gate now accepts
both `'main_thread_forkserver'` and
`'subint_forkserver'` (both go through the
IPC-`SpawnSpec` path).
- the variant-1 spawn-coro stamps its own `SpawnSpec` /
log lines with `spawn_method='main_thread_forkserver'`
so subactor renders reflect the actual mechanism.
- docstring reorg: trio×fork hazard breakdown, POSIX
fork-survival semantics, in-process-vs-stdlib
forkserver design notes, and the TODO/cleanup section
all move from `_subint_forkserver` to
`_main_thread_forkserver` (lives with the working
code). `_subint_forkserver` keeps a tight forward-
looking doc that motivates the reserved key.
- `run_subint_in_worker_thread()` stays in
`_subint_forkserver` as the companion primitive — it's
the subint counterpart to `fork_from_worker_thread()`
and will plug into the future variant-2 spawn-coro.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Move the truly-generic main-interp-worker-thread fork primitives
(`fork_from_worker_thread`, `_close_inherited_fds`, `_ForkedProc`,
`wait_child`, `_format_child_exit`) out of `_subint_forkserver.py` into
a sibling `_main_thread_forkserver.py` module so the primitive layer is
honestly named — none of these helpers touch a subint, they just fork
from a main-interp worker thread.
`_subint_forkserver.py` keeps its public surface intact via re-export so
any existing `from tractor.spawn._subint_forkserver import ...` callsite
still resolves.
Net: zero behavior change, preps the way for the upcoming spawn-method
key split where `main_thread_forkserver` ships as the working backend
and `subint_forkserver` becomes reserved for the future
subint-isolated-child variant (gated on jcrist/msgspec#1026).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adds a "Future arch — what subints would buy us" section to
the module docstring, complementing the prior commit's
current-state rationale. Code is unchanged.
Frames the `subint` prefix as family-naming today (no actual
subinterp is created yet), then lays out the three concrete
wins that land once jcrist/msgspec#1026 unblocks PEP 684
isolated-mode subints:
- Cheaper forks — moving the parent's `trio.run()` into a
subint shrinks the main-interp COW image the child inherits.
The main interp becomes the literal forkserver: an
intentionally-empty execution ctx whose only job is to call
`os.fork()` cleanly.
- True parallelism — per-interp GIL means the forkserver
thread on main and the trio thread on subint actually run in
parallel. Spawn latency stops stalling the trio loop.
- Multi-actor-per-process — the architectural payoff. With
per-interp-GIL subints, one process can host main + N
subint-resident actor `trio.run()`s, and `os.fork()` reverts
to the last-resort spawn (only when OS-level isolation is
actually needed). Joins the story with the in-thread
`_subint.py` backend: `subint` → in-process spawn,
`subint_forkserver` → cross-process when a real OS boundary
is required.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Major expansion of the module docstring. Code is
unchanged; this lands the architectural reasoning that
was previously implicit, plus the POSIX/trio fork
mechanics the design relies on.
New sections:
- "Design rationale" — answers two implicit questions:
(1) why a forkserver pattern at all (vs. forking
directly from a trio task), (2) why in-process (vs.
stdlib `mp.forkserver`'s sidecar process). Documents
the three costs the in-process design avoids
(sidecar lifecycle, per-spawn IPC, cold-start child)
and the tradeoffs we accept in exchange (3.14-only,
heavier than `to_thread.run_sync`).
- "Implementation status" — clarifies what's actually
landed today vs. the envisioned arch: parent's
`trio.run()` still lives on main interp (subint-
hosted root gated on jcrist/msgspec#1026). Names
why the "subint" prefix is correct anyway — same PR
series as `_subint.py` / `_subint_fork.py`.
- "What survives the fork? — POSIX semantics" — POSIX
preserves only the calling thread, so the
`trio.run()` thread is gone in the child. Includes
a small parent/child thread-survival table and
covers the four artifact classes that DO cross the
fork boundary (inherited fds, COW memory, Python
thread state, user-level locks) and how each is
handled.
- "FYI: how this dodges the `trio.run()` × `fork()`
hazards" — itemizes each class of trio process-
global state (wakeup-fd, `epoll`/`kqueue`,
threadpool, cancel scopes / nurseries, `atexit`,
foreign-language I/O) and explains how the
forkserver-thread design avoids each.
Also,
- bump the gated msgspec issue link from
`jcrist/msgspec#563` to `jcrist/msgspec#1026` (the
PEP 684 isolated-mode tracker).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two diagnostic gaps in `tractor.spawn._subint.subint_proc()` that hid
otherwise-silent failures, plus tracking-issue links on the two open
`subint_forkserver` follow-ups.
Deats,
- bootstrap-exc visibility: wrap the call to
`_interpreters.exec(interp_id, bootstrap)` with
`try/except BaseException` + `log.exception(...)`.
* Without it, an `ImportError` / `SyntaxError` raised inside the
dedicated driver thread goes only to Python's default thread
excepthook — invisible to the parent, which then waits forever on
`subint_exited.wait()`.
* `?TODO` notes `anyio`'s `to_interpreter._interp_call` +
`(retval, is_exception)` pattern as the next step for re-raising;
skipped now bc it must coordinate with the `trio.Cancelled` paths
around the existing `.wait()` calls.
- cancel-leak disambiguation: when the driver thread doesn't exit within
`_HARD_KILL_TIMEOUT`, also log `_interpreters.is_running(interp_id)`
as `subint_still_running=...` so the operator can tell "thread leaked,
subint already done" apart from "thread alive bc subint is wedged".
* pattern borrowed from `trio-parallel`'s `_sint.SintWorker.is_alive()`.
- `?TODO` near the `bootstrap` literal: future switch to
`_interpreters.set___main___attrs()` — same API `anyio`
uses in `to_interpreter._Worker.call()` — for passing
non-`repr()`-roundtrippable values (`SpawnSpec` struct, callables,
etc).
* add cross-refs tracking issue `#379`.
Also,
- `Tracked at: [#449]` link on
`subint_forkserver_test_cancellation_leak_issue.md`.
- `Tracked at: [#450]` link on
`subint_forkserver_thread_constraints_on_pep684_issue.md`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Same wire-up pattern as the prior `test_dynamic_pub_sub`
commit: each test that already pulled in `debug_mode`
now also pulls in `reg_addr` and passes
`registry_addrs=[reg_addr]` into `tractor.open_nursery()`,
so the suite's standard registry-addr conventions apply.
Tests touched:
- `test_started_misuse`
- `test_simple_context`
- `test_parent_cancels`
- `test_one_end_stream_not_opened`
- `test_maybe_allow_overruns_stream`
- `test_ctx_with_self_actor`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Pull in the `reg_addr`, `debug_mode`, and `test_log`
fixtures so this test follows the same conventions as
the rest of the suite:
- pass `registry_addrs=[reg_addr]` + `debug_mode` into
`tractor.open_nursery()` (so `--tpdb` etc work).
- after the `pytest.raises` block, add `assert err` +
`test_log.exception('Timed out AS EXPECTED')` so the
expected timeout is logged explicitly instead of
swallowed.
Also,
- drop whitespace-only blank lines around the
`subs` param of `consumer()` and `ctx` param of
`one_task_streams_and_one_handles_reqresp()`.
- promote `test_sigint_both_stream_types`'s one-line
docstring to multi-line form.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Seems that when run in-suite it delays more then the so-measured "happy
path" timing; better to have no suite-global interruption then asserting
a fast single test's run.
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.
Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
+ `reap_shm()` helpers. Match criteria: regular file
under `/dev/shm`, owned by current uid, AND no live
proc has it open (mmap'd or fd-held). In-use
enumeration via `psutil.Process.memory_maps()` +
`.open_files()` — xplatform, kernel-canonical (same
answer `lsof` would give), no reliance on
tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
`NotImplementedError` outside Linux/FreeBSD bc macOS
POSIX shm has no fs-visible path (`shm_open` only)
and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
process reap) and `--shm-only` (skip process phase)
flags. `-n` dry-runs both phases. Exit code is `1`
if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
the `testing` dep group; lazy-imported in `_reap.py`
so the process-reap path stays import-clean without
it.
Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
(new section 10c) — covers match criteria + the
preservation guarantee for unrelated apps.
- flip mitigation status in
`subint_forkserver_mp_shared_memory_issue.md` from
"could extend `tractor-reap`" to "implemented", with
a note that callers should still UUID-pin shm keys to
avoid cross-session collisions.
Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.
Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
`platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
now run unconditionally:
* monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
`ManTracker` subclass (empty `register` / `unregister`
/ `ensure_running`).
* return `partial(SharedMemory, track=False)` for the per-allocation
opt-out.
* belt + suspenders: even if something dodges the wrapper, the
singleton can't talk to the inherited (broken) parent fd.
- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
skip of the unlink-callback; install a `try_unlink()` wrapper that
swallows `FileNotFoundError` (sibling-already-cleaned race in
shared-key setups). Without `mp.resource_tracker` doing it for us, we
own the unlink — `actor. lifetime_stack` is the right place since
tractor already controls actor lifecycle.
- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
module-level skip- list (tests pass now). Inline comment cross-refs
the two `_mp_bs` / `_shm` workarounds.
- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
rewrite — flips status from "open / unresolvable in tractor" to
"resolved, kept as decision record". Adds Resolution section, "Why
this is the right call" rationale (mp tracker is widely criticized;
tractor already owns lifecycle), trade-offs (crash-leaked segments,
lost mp leak warning), verification (7 passed under both
`subint_forkserver` and `trio` backends), and upstream issue links
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/` doc: `mp.SharedMemory` is
fork-without-exec unsafe — child inherits parent's
`resource_tracker` fd → EBADF on first shm op;
leaked `/shm_list` cascades `FileExistsError`
across parametrize variants. Canonical CPython
issue class, NOT a tractor bug. Includes two
longer-term mitigation paths (reset inherited
tracker fd vs migrate off `mp.shared_memory`).
Also, update `tests/test_shm.py`:
- comment out `subint_forkserver` from skip list
- rewrite reason with precise failure-mode
descriptions + link to the analysis doc
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:
- orphan-mode (default): finds PPid==1 procs
with cwd matching repo root + `python` in
cmdline.
- descendant-mode (`--parent <pid>`): scoped
sweep under a still-live supervisor.
SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).
Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.
Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
`/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
- `find_descendants(parent_pid)` for the in-session case
(PPid-direct-match while pytest is still alive).
- `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
reparented to init + `cwd` filter to repo root + `python` cmdline
filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
all, poll up to `grace` for exit, SIGKILL any survivors. Returns
`(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
`tractor/_testing/pytest.py` — after `yield`, runs
`find_descendants(os.getpid())` + `reap(...)` so each pytest session
leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
commit) for the pytest-died-mid-session case where the in-session
fixture didn't get to run.
Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
module-top in `pytest.py` (was inline-imported inside
`pytest_generate_tests`), and reuse it in
`pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
mark arg is a valid spawn-method literal — catches typos at collection
time.
- inline `# ?TODO` flags running these through the `try_set_backend`
checker for stronger validation.
Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Re-classify `test_orphaned_subactor_sigint_cleanup_DRAFT` from
flakey-env-sensitive (`strict=False` w/ "passes in isolation, flakey in
full suite") to a hard known-gap (`strict=True`) with the orphan-SIGINT
hang as the documented cause. The previous framing ("env pollution") let
the test silently pass when ordering happened to favor it; the new
framing forces an XPASS-as-FAIL the moment the underlying gap is
actually closed, so we can drop the mark intentionally instead of
accidentally.
Reason text + leading `# Known-gap test —` comment both point at
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
for the full diagnosis.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.
Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
- add `reg_addr`, `debug_mode`, `start_method`
fixture params
- `delay` now reads the `debug_mode` param directly
instead of calling `tractor.debug_mode()` (fires
slightly earlier in the test lifecycle)
- sanity assert `if debug_mode: assert
tractor.debug_mode()` after nursery open
- new print showing SIGINT target
(`send_sigint_to` + resolved pid)
- catch `trio.TooSlowError` around
`ctx.wait_for_result()` and conditionally
`pytest.xfail` when `send_sigint_to == 'child'
and start_method == 'subint_forkserver'` — the
known orphan-SIGINT limitation tracked in
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).
Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
`os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
`sys.stderr`).
- KEPT: `pytest -s` debugging behavior.
This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.
Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
rationale comment cross-ref'ing the post-mortem doc
- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
from `test_nested_ multierrors` — no longer needed.
* file-level `pytestmark` covers any residual.
- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
mark loosened from `strict=True` to `strict=False` + reason rewritten.
* it passes in isolation but is session-env-pollution sensitive
(leftover subactor PIDs competing for ports / inheriting harness
FDs).
* tolerate both outcomes until suite isolation improves.
- `test_shm`: extend the existing
`skipon_spawn_backend('subint', ...)` to also skip
`'subint_forkserver'`.
* Different root cause from the cancel-cascade class:
`multiprocessing.SharedMemory`'s `resource_tracker` + internals
assume fresh- process state, don't survive fork-without-exec cleanly
- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
(unrelated to forkserver; just a flaky-under-load bump).
- `tractor.spawn._subint_forkserver`: inline comment-only future-work
marker right before `_actor_child_main()` describing the planned
conditional stdout/stderr-to-`/dev/null` redirect for cases where
`--capture=sys` isn't enough (no code change — the redirect logic
itself is deferred).
EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Encode the hard-won lesson from the forkserver
cancel-cascade investigation into two skill docs
so future sessions grep-find it before spelunking
into trio internals.
Deats,
- `.claude/skills/conc-anal/SKILL.md`:
- new "Unbounded waits in cleanup paths"
section — rule: bound every `await X.wait()`
in cleanup paths with `trio.move_on_after()`
unless the setter is unconditionally
reachable. Recent example:
`ipc_server.wait_for_no_more_peers()` in
`async_main`'s finally (was unbounded,
deadlocked when any peer handler stuck)
- new "The capture-pipe-fill hang pattern"
section — mechanism, grep-pointers to the
existing `conftest.py` guards (`tests/conftest
.py:258`, `:316`), cross-ref to the full
post-mortem doc, and the grep-note: "if a
multi-subproc tractor test hangs, `pytest -s`
first, conc-anal second"
- `.claude/skills/run-tests/SKILL.md`: new
"Section 9: The pytest-capture hang pattern
(CHECK THIS FIRST)" with symptom / cause /
pre-existing guards to grep / three-step debug
recipe (try `-s`, lower loglevel, redirect
stdout/stderr) / signature of this bug vs. a
real code hang / historical reference
Cost several investigation sessions before the
capture-pipe issue surfaced — it was masked by
deeper cascade deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume. Lesson: **grep this
pattern first when any multi-subproc tractor test
hangs under default pytest but passes with `-s`.**
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.
Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.
Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.
Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.
Deats,
- `subint_forkserver_test_cancellation_leak_issue
.md`: new section "Update — VERY late: pytest
capture pipe IS the final gate" w/ DIAG timeline
showing `trio.run` fully returns, diagnosis of
pipe-fill mechanism, retrospective on the
earlier wrong ruling-out, and fix direction
(redirect subactor stdout/stderr to `/dev/null`
in fork-child prelude, conditional on
pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
rewritten to describe the capture-pipe gate
specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
orphan-SIGINT test regresses back to xfail.
Previously passed after the FD-hygiene fix,
but the new `wait_for_no_more_peers(
move_on_after=3.0)` bound in `async_main`'s
teardown added up to 3s latency, pushing
orphan-subactor exit past the test's 10s poll
window. Real fix: faster orphan-side teardown
OR extend poll window to 15s
No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.
Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.
Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).
Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.
Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
wait in the teardown cascade above `async_main`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.
Fresh-run results,
- **9 processes complete the full flow**:
`trio.run RETURNED NORMALLY` → `child_target
RETURNED rc=0` → `os._exit(0)`. These are tree
LEAVES (errorers) plus their direct parents
(depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
main)`**: hit "about to trio.run" but never
see "trio.run RETURNED NORMALLY". These are
root + top-level spawners + one intermediate
The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:
trio.run never returns
→ _trio_main finally never runs
→ _worker never reaches os._exit(rc)
→ process never dies
→ parent's _ForkedProc.wait() blocks
→ parent's nursery hangs
→ parent's async_main hangs
→ (recurse up)
The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
in `root_tn` — but we cancel it via
`_parent_chan_cs.cancel()` in `Actor.cancel()`,
which only runs during
`open_root_actor.__aexit__`, which itself runs
only after `async_main`'s outer unwind — which
doesn't happen. So the shield isn't broken in
this path.
2. `actor_nursery._join_procs.wait()` or similar
inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
exit — but pidfd_open watch didn't fire (race
between `pidfd_open` and the child exiting?).
Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.
Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.
Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
retested explicitly — `test_nested_multierrors`
hangs identically with and without `-s`. The
earlier observation was likely a competing
pytest process I had running in another session
holding registry state
- **stuck peer-chan recv that cancel can't
break**: pivot from the prior pass. With
`handle_stream_from_peer` instrumented at ENTER
/ `except trio.Cancelled:` / finally: 40
ENTERs, ZERO `trio.Cancelled` hits. Cancel never
reaches those tasks at all — the recvs are
fine, nothing is telling them to stop
Actual deadlock shape: multi-level mutual wait.
root blocks on spawner.wait()
spawner blocks on grandchild.wait()
grandchild blocks on errorer.wait()
errorer Actor.cancel() ran, but proc
never exits
`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.
Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
shielded loop's recv is stuck in a way cancel
still can't break
2. `async_main`'s post-cancel unwind has other
tasks in `root_tn` awaiting something that
never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
`_child_target` never returns
Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
confirm whether `child_target()` returns /
`os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
equivalent level to spot divergence
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Three places that previously swallowed exceptions silently now log via
`log.exception()` so they surface in the runtime log when something
weird happens — easier to track down sneaky failures in the
fork-from-worker-thread / subint-bootstrap primitives.
Deats,
- `_close_inherited_fds()`: post-fork child's per-fd `os.close()`
swallow now logs the fd that failed to close. The comment notes the
expected failure modes (already-closed-via-listdir-race,
otherwise-unclosable) — both still fine to ignore semantically, but
worth flagging in the log.
- `fork_from_worker_thread()` parent-side timeout branch: the
`os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close
failure separately before raising the `worker thread didn't return`
RuntimeError.
- `run_subint_in_worker_thread._drive()`: when
`_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`,
log the full call signature (interp_id + bootstrap) along with the
captured exception, before stashing into `err` for the outer caller.
Behavior unchanged — only adds observability.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two new sections in
`subint_forkserver_test_cancellation_leak_issue.md`
documenting continued investigation of the
`test_nested_multierrors[subint_forkserver]` peer-
channel-loop hang:
1. **"Attempted fix (DID NOT work) — hypothesis
(3)"**: tried sync-closing peer channels' raw
socket fds from `_serve_ipc_eps`'s finally block
(iterate `server._peers`, `_chan._transport.
stream.socket.close()`). Theory was that sync
close would propagate as `EBADF` /
`ClosedResourceError` into the stuck
`recv_some()` and unblock it. Result: identical
hang. Either trio holds an internal fd
reference that survives external close, or the
stuck recv isn't even the root blocker. Either
way: ruled out, experiment reverted, skip-mark
restored.
2. **"Aside: `-s` flag changes behavior for peer-
intensive tests"**: noticed
`test_context_stream_semantics.py` under
`subint_forkserver` hangs with default
`--capture=fd` but passes with `-s`
(`--capture=no`). Working hypothesis: subactors
inherit pytest's capture pipe (fds 1,2 — which
`_close_inherited_fds` deliberately preserves);
verbose subactor logging fills the buffer,
writes block, deadlock. Fix direction (if
confirmed): redirect subactor stdout/stderr to
`/dev/null` or a file in `_actor_child_main`.
Not a blocker on the main investigation;
deserves its own mini-tracker.
Both sections are diagnosis-only — no code changes
in this commit.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:
1. Skip-mark the test via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test
matrix while the remaining bug is being chased.
The reason string cross-refs the conc-anal doc
for full context.
2. Update the conc-anal doc
(`subint_forkserver_test_cancellation_leak_issue.md`) with the
empirical state after the three nested- cancel fix commits
(`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
shield break) landed, narrowing the remaining hang from "everything
broken" to "peer-channel loops don't exit on `service_tn` cancel".
Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
`_parent_chan_cs.cancel()` wiring from `57935804`
works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
channel handlers in `handle_stream_from_peer`
(inbound connections handled by
`stream_handler_tn`, which IS `service_tn` in the
current config).
- after `_parent_chan_cs.cancel()` fires, NEW
shielded loops appear on the session reg_addr
port — probably discovery-layer reconnection;
doesn't block teardown but indicates the cascade
has more moving parts than expected.
The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Completes the nested-cancel deadlock fix started in
0cd0b633 (fork-child FD scrub) and fe540d02 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.
Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.
Deats,
- `async_main` stashes the cancel-scope returned by
`root_tn.start(...)` for the parent-chan
`process_messages` task onto the actor as
`_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
`ipc_server.cancel()` + `wait_for_shutdown()`)
calls `self._parent_chan_cs.cancel()` to
explicitly break the shield — no more waiting for
EOF delivery, unwinding proceeds deterministically
regardless of backend
- inline comments on both sites explain the mutual-
wait deadlock + why the explicit cancel is
backend-agnostic rather than a forkserver-specific
workaround
With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two coordinated improvements to the `subint_forkserver` backend:
1. Replace `trio.to_thread.run_sync(os.waitpid, ...,
abandon_on_cancel=False)` in `_ForkedProc.wait()`
with `trio.lowlevel.wait_readable(pidfd)`. The
prior version blocked a trio cache thread on a
sync syscall — outer cancel scopes couldn't
unwedge it when something downstream got stuck.
Same pattern `trio.Process.wait()` and
`proc_waiter` (the mp backend) already use.
2. Drop the `@pytest.mark.xfail(strict=True)` from
`test_orphaned_subactor_sigint_cleanup_DRAFT` —
the test now PASSES after 0cd0b633 (fork-child
FD scrub). Same root cause as the nested-cancel
hang: inherited IPC/trio FDs were poisoning the
child's event loop. Closing them lets SIGINT
propagation work as designed.
Deats,
- `_ForkedProc.__init__` opens a pidfd via
`os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+)
- `wait()` parks on `trio.lowlevel.wait_readable()`,
then non-blocking `waitpid(WNOHANG)` to collect
the exit status (correct since the pidfd signal
IS the child-exit notification)
- `ChildProcessError` swallow handles the rare race
where someone else reaps first
- pidfd closed after `wait()` completes (one-shot
semantics) + `__del__` belt-and-braces for
unexpected-teardown paths
- test docstring's `@xfail` block replaced with a
`# NOTE` comment explaining the historical
context + cross-ref to the conc-anal doc; test
remains in place as a regression guard
The two changes are interdependent — the
cancellable `wait()` matters for the same nested-
cancel scenarios the FD scrub fixes, since the
original deadlock had trio cache workers wedged in
`os.waitpid` swallowing the outer cancel.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Implements fix-direction (1)/blunt-close-all-FDs from
b71705bd (`subint_forkserver` nested-cancel hang
diag), targeting the multi-level cancel-cascade
deadlock in
`test_nested_multierrors[subint_forkserver]`.
The diagnosis doc voted for surgical FD cleanup via
`actor.ipc_server` handle as the cleanest approach,
but going blunt is actually the right call: after
`os.fork()`, the child immediately enters
`_actor_child_main()` which opens its OWN IPC
sockets / wakeup-fd / epoll-fd / etc. — none of the
parent's FDs are needed. Closing everything except
stdio is safe AND defends against future
listener/IPC additions to the parent inheriting
silently into children.
Deats,
- new `_close_inherited_fds(keep={0,1,2}) -> int`
helper. Linux fast-path enumerates `/proc/self/fd`;
POSIX fallback uses `RLIMIT_NOFILE` range. Matches
the stdlib `subprocess._posixsubprocess.close_fds`
strategy. Returns close-count for sanity logging
- wire into `fork_from_worker_thread._worker()`'s
post-fork child prelude — runs immediately after
the pid-pipe `os.close(rfd/wfd)`, before the user
`child_target` callable executes
- docstring cross-refs the diagnosis doc + spells
out the FD-inheritance-cascade mechanism and why
the close-all approach is safe for our spawn shape
Validation pending: re-run `test_nested_multierrors[subint_forkserver]`
to confirm the deadlock is gone.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:
1. **5-zombie leak holding `:1616`** — turned out to
be a self-inflicted cleanup bug: `pkill`-ing a bg
pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
the SC graceful cancel cascade entirely. Codified
the real fix — SIGINT-first ladder w/ bounded
wait before SIGKILL — in e5e2afb5 (`run-tests`
SKILL) and
`feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
hangs indefinitely** — the actual backend bug,
and it's a deadlock not a leak.
Deats,
- new diagnosis: all 5 procs are kernel-`S` in
`do_epoll_wait`; pytest-main's trio-cache workers
are in `os.waitpid` waiting for children that are
themselves waiting on IPC that never arrives —
graceful `Portal.cancel_actor` cascade never
reaches its targets
- tree-structure evidence: asymmetric depth across
two identical `run_in_actor` calls — child 1
(3 threads) spawns both its grandchildren; child 2
(1 thread) never completes its first nursery
`run_in_actor`. Smells like a race on fork-
inherited state landing differently per spawn
ordering
- new hypothesis: `os.fork()` from a subactor
inherits the ROOT parent's IPC listener FDs
transitively. Grandchildren end up with three
overlapping FD sets (own + direct-parent + root),
so IPC routing becomes ambiguous. Predicts bug
scales with fork depth — matches reality: single-
level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
reaches hard-kill path), `:1616` contention (fixed
by `reg_addr` fixture wiring), GIL starvation
(each subactor has its own OS process+GIL),
child-side KBI absorption (`_trio_main` only
catches KBI at `trio.run()` callsite, reached
only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
`closerange()`, (2) `FD_CLOEXEC` + audit,
(3) targeted FD cleanup via `actor.ipc_server`
handle, (4) `os.posix_spawn` w/ `file_actions`.
Vote: (3) — surgical, doesn't break the "no exec"
design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
level-spawn tests under the backend via
`@pytest.mark.skipon_spawn_backend(...)` until
fix lands
Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The previous cleanup recipe went straight to
SIGTERM+SIGKILL, which hides bugs: tractor is
structured concurrent — `_trio_main` catches SIGINT
as an OS-cancel and cascades `Portal.cancel_actor`
over IPC to every descendant. So a graceful SIGINT
exercises the actual SC teardown path; if it hangs,
that's a real bug to file (the forkserver `:1616`
zombie was originally suspected to be one of these
but turned out to be a teardown gap in
`_ForkedProc.kill()` instead).
Deats,
- step 1: `pkill -INT` scoped to `$(pwd)/py*` — no
sleep yet, just send the signal
- step 2: bounded wait loop (10 × 0.3s = ~3s) using
`pgrep` to poll for exit. Loop breaks early on
clean exit
- step 3: `pkill -9` only if graceful timed out, w/
a logged escalation msg so it's obvious when SC
teardown didn't complete
- step 4: same SIGINT-first ladder for the rare
`:1616`-holding zombie that doesn't match the
cmdline pattern (find PID via `ss -tlnp`, then
`kill -INT NNNN; sleep 1; kill -9 NNNN`)
- steps 5-6: UDS-socket `rm -f` + re-verify
unchanged
Goal: surface real teardown bugs through the test-
cleanup workflow instead of papering over them with
`-9`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).
Deats,
- add `reg_addr: tuple` fixture param to:
- `test_cancel_infinite_streamer`
- `test_some_cancels_all`
- `test_nested_multierrors`
- `test_cancel_via_SIGINT`
- `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
two `open_nursery()` calls that previously had no
kwargs at all (in `test_cancel_via_SIGINT` and
`test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
to `test_nested_multierrors` so a hung run doesn't
wedge the whole session
Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.
Deats,
- TL;DR + ruled-out checks confirming the procs are
ours (not piker / other tractor-embedding apps) —
`/proc/$pid/cmdline` + cwd both resolve to this
repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
(plain `os.kill(SIGKILL)` to the direct child),
not tree-scoped — grandchildren spawned during a
multi-level cancel test get reparented to init and
inherit the registry listen socket
- proposed fix directions ranked: (1) put each
forkserver-spawned subactor in its own process-
group (`os.setpgrp()` in fork-child) + tree-kill
via `os.killpg(pgid, SIGKILL)` on teardown,
(2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
`/proc/<pid>/task/*/children` walk. Vote: (1) —
POSIX-standard, aligns w/ `start_new_session=True`
semantics in `subprocess.Popen` / trio's
`open_process`
- inline reproducer + cleanup recipe scoped to
`$(pwd)/py314/bin/python.*pytest.*spawn-backend=
subint_forkserver` so cleanup doesn't false-flag
unrelated tractor procs (consistent w/
`run-tests` skill's zombie-check guidance)
Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Fork-based backends (esp. `subint_forkserver`) can
leak child actor processes on cancelled / SIGINT'd
test runs; the zombies keep the tractor default
registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`)
bound, so every subsequent session can't bind and
50+ unrelated tests fail with the same
`TooSlowError` / "address in use" signature. Document
the pre-flight + post-cancel check as a mandatory
step 4.
Deats,
- **primary signal**: `ss -tlnp | grep ':1616'` for a
bound TCP registry listener — the authoritative
check since :1616 is unique to our runtime
- `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.*
_actor_child_main|subint-forkserv` for leftover
actor/forkserver procs — scoped deliberately so we
don't false-flag legit long-running tractor-
embedding apps like `piker`
- `ls /tmp/registry@*.sock` for stale UDS sockets
- scoped cleanup recipe (SIGTERM + SIGKILL sweep
using the same `$(pwd)/py*` pattern, UDS `rm -f`,
re-verify) plus a fallback for when a zombie holds
:1616 but doesn't match the pattern: `ss -tlnp` →
kill by PID
- explicit false-positive warning calling out the
`piker` case (`~/repos/piker/py*/bin/python3 -m
tractor._child ...`) so a bare `pgrep` doesn't lead
to nuking unrelated apps
Goal: short-circuit the "spelunking into test code"
rabbit-hole when the real cause is just a leaked PID
from a prior session, without collateral damage to
other tractor-embedding projects on the same box.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code