New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).
Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.
Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
wrapper's fixture injection.
Also,
- add `alert_on_finish` session fixture (terminal
bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
our plugin eventually however..
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Under `main_thread_forkserver` the bootstrapping
hook switches to `--capture=sys`, so subactor
fd-level output (tree dumps, zombie-reaper msgs)
isn't captured per-test by pexpect. Gate those
expects behind a `no_capfd` check so the test
passes on both capture modes.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Only override `tractor.log._default_loglevel` when
the flag is explicitly passed — lets per-spawn and
per-example `loglevel` kwargs take effect instead
of being clobbered by the hard-coded `'ERROR'`
default.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Pass explicit `loglevel` to `spawn()` calls in
`test_debugger` tests — required for pexpect
pattern matching now that examples no longer
hard-code log levels.
Also,
- make `expect()` return the decoded `before` str.
- add `start_method` param + fork-backend timeout
slack (+4s) in nested-error test.
- clean up debug examples: drop unused loglevels,
rename `n` -> `an`, fix docstrings, add TODO
comments for tpt parametrize via osenv.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `disable_pdbp_color()` to the `sync_bp` example
to suppress pygments prompt coloring when
`PYTHON_COLORS=0` — makes pexpect pattern matching
deterministic.
Deats,
- set `loglevel='pdb'` in both script + test spawn.
- disable `enable_stack_on_sig` in example, assert
no `stackscope` output in test.
- update `attach_patts` keys/values with `|_<Task`
/ `|_<Thread` / `|_('subactor'` prefixes to match
actual tree-dump format.
- add call-site patterns (`tractor.pause_from_sync()`
`tractor.pause()`, `breakpoint(hide_tb=...)`).
- trim trailing `\n` from `Lock.repr()` output.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.
Deats,
- add `use_stackscope: bool` to `RuntimeVars`
struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
successful `stackscope` import, asserts unset
on `ImportError`
- nest stackscope init under `_debug_mode` gate
in `Actor.async_main`, check rtvar alongside
env var
- defer `maybe_init_greenback` import to its own
`use_greenback` branch
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Factor the sub-actor relay loop out of
`dump_tree_on_sig()` into `_relay_sig_to_subactors()`
and chain both dump + relay in a single
`run_sync_soon` callback (`_dump_then_relay`) so the
parent's task-tree flushes BEFORE any sub receives
the signal — fixes a hierarchical-ordering race
where subs could dump ahead of the parent in the
muxed pty stream.
Also,
- gate file/tty sink writes behind `write_file` +
`write_tty` params on `dump_task_tree()`.
- use `actor.aid.uid` instead of deprecated `.uid`.
- update `test_shield_pause` expects to match the
new sequential parent -> relay-log -> sub ordering.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.
Also,
- load `tractor._testing.pytest` via `-p` in ini
(bc bootstrapping hooks must register before
conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
and pop env var on disable.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Wire up `find_orphaned_uds()` + `reap_uds()` from
`_reap` as a new phase-3 UDS sweep in the CLI
script. Opt-in via `--uds` (run after proc reap +
shm) or `--uds-only` (skip other phases).
Also,
- consolidate skip-proc-reap logic into a single
`skip_proc_reap` bool covering both `--shm-only`
and `--uds-only`
- extend header docstring + usage examples
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:
- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
`reap_uds()` — detect `<name>@<pid>.sock` files under
`${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
lingering subactors, wait, SIGKILL survivors, then sweep orphaned
UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
snapshot sock-file dir before/after each test, warn + reap new
orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
with known-leaky teardown.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
In `tests/devx/conftest.py::spawn`, refactor the
fixture-internal closures so consumer tests can pass
explicit `start_method`/`loglevel` to each `_spawn()`
invocation rather than only inheriting the fixture-
scoped parametrize values.
Deats,
- promote `set_spawn_method()` and `set_loglevel()`
to take their respective values as fn params (vs
closing over the fixture-scope vars).
- give `_spawn()` `start_method=start_method` and
`loglevel: str|None = None` kwargs so callers
override one-off without re-parametrizing the
suite. NOTE: this drops the implicit fixture-
scoped `loglevel` forward — `_spawn()` callers
now must pass `loglevel=...` explicitly.
- TODO: figure out how `--ll <level>` should map to
the default (currently `None` → uses env-var or
tractor default).
- add a docstring to `_spawn()` so its role as the
consumer-facing closure is obvious from `help()`.
Also,
- `assert_before()` now returns the `.before` output
on success (was `None`); add a one-line docstring
describing the new return contract.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`main_thread_forkserver` doesn't actually need py3.14
`concurrent.interpreters` (PEP 734) — it forks from a
non-trio worker thread and runs `_trio_main` in the child,
same shape as `trio_proc`. The previous `_has_subints`
gate + subint-family `case` arm were a copy-paste error.
In `tractor.spawn._main_thread_forkserver`,
- drop the `_has_subints` import + the `RuntimeError`
raise in `main_thread_forkserver_proc()`.
- drop the now-unused `import sys` (only used by the
prior error msg).
In `tractor.spawn._spawn.try_set_start_method()`,
- pull `'main_thread_forkserver'` out of the subint-
family arm (which still gates on `_has_subints`).
- merge it into the `'trio'` arm — both set `_ctx = None`
bc neither needs an `mp.context`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
In `pyproject.toml`,
- include the `sync_pause` group from `dev`, so dev
installs ship `greenback` for `pause_from_sync()`.
Comment out per-test `@pytest.mark.timeout(...)`
markers in,
- `tests/devx/test_debugger.py`
- `tests/discovery/test_registrar.py`
- `tests/spawn/test_main_thread_forkserver.py`
- `tests/spawn/test_subint_cancellation.py`
- `tests/test_advanced_streaming.py`
- `tests/test_cancellation.py`
The global cap was already dropped (3c366cac); these
were the leftover per-test caps which now block
interactive `pdb` flows under the new spawn backends.
In `uv.lock`,
- pull `greenback` into the resolved `dev` deps
(per the `sync_pause` include above).
- catch up the prior `xonsh` editable→PyPI switch
(from the `pyproject.toml` `tool.uv.sources` edit).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add env-var overrides inside `._root.open_root_actor()` so
devs/test-runs can swap the actor-spawn backend or crank
console verbosity *without* touching application code.
In `._root.open_root_actor()`,
- read `TRACTOR_LOGLEVEL` early, overriding any caller-passed
`loglevel` and stashing an `env_ll_report` to emit once the
console log is set up.
- pull the `loglevel` fallback (`or _default_loglevel`) and
`log.get_console_log()` init *up* so the env-var report
routes through tractor's own logger.
- read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed
`start_method` and warn-logging when the env-var clobbers
an explicit caller value.
Wire the same vars through `tests/devx/conftest.py::spawn`,
- request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL`
and `TRACTOR_SPAWN_METHOD` in `os.environ` before each
`pexpect.spawn()` (inherited by the example subproc).
- expand `supported_spawners` to include
`main_thread_forkserver` and `subint_forkserver` bc
example scripts no longer need per-script CLI plumbing.
- pop both vars in fixture teardown so a leaked value can't
re-route a later in-process tractor test's spawn-backend
or loglevel.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.
Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.
Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).
`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.
Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
hint in the rendered dump header so users know where
to find the artifact even when stdio is captured.
- swap the inline `f' |_{actor}'` line for a
`_pformat.nest_from_op` rendering of `actor_repr`
(matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
branches now note `(trio_token captured: <bool>)`
so it's obvious from the log whether the full-tree
path is wired.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two cleanup tweaks in `_main_thread_forkserver`:
Doc, "what survives the fork?" section — expand the
"non-calling threads are gone in the child" claim with
the precise execution-vs-memory split that reconciles
this module's prior framing with trio's (canonical
[python-trio/trio#1614][trio-1614]) "leaked stacks"
framing:
- execution-side: only the calling thread runs
post-fork; all others never execute another
instruction.
- memory-side: those non-running threads' stacks +
per-thread heap structures are still COW-inherited
as orphaned bytes — what trio means by "leaked".
Same POSIX reality, opposite sides; the table is
extended to a 4-col `parent | child (executing) |
child (memory)` layout to make both views explicit.
Also blank-line-padded the bulleted hazard classes
for cleaner markdown rendering.
[trio-1614]: https://github.com/python-trio/trio/issues/1614
Code, `_close_inherited_fds()` log noise — split the
catch-all `except OSError` into:
- `EBADF` — benign race where the dirfd that
`os.listdir('/proc/self/fd')` itself opened ends up
in `candidates`, then auto-closes before the loop
reaches it. Demote to `log.debug()` + `continue`;
prior `log.exception` drowned the post-fork log
channel with stack traces every spawn.
- other errnos (EIO / EPERM / EINTR / ...) keep the
loud `log.exception` surface — those ARE genuinely
unexpected.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.
Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
* exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
inherit it via environ,
* installs the handler in pytest itself via
`enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
existing `_debug_mode` gate to ALSO fire when
`TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
install the same handler at runtime startup.
Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:
- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
always written) — guaranteed-readable artifact even
under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
tty; ignored if device is missing.
Other,
- squelch the benign `RuntimeWarning` ("coroutine method
'asend'/'athrow' was never awaited") from
`stackscope._glue`'s import-time async-gen type
introspection so `--enable-stackscope` setup stays
quiet.
- log msg in the `_runtime` ImportError branch now
mentions `--enable-stackscope` alongside debug-mode.
Usage,
pytest --enable-stackscope -k <hang-test>
# in another shell, find the pid + signal:
kill -USR1 <pytest-or-subactor-pid>
# tail the artifact:
tail -f /tmp/tractor-stackscope-<pid>.log
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Mirror `060f7d24`'s pattern (backend-aware timeout in
`maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard
`trio.fail_after` cap. Fork-based backends pay per-spawn
fork+IPC-handshake cost which stacks over `cpus - 1`
sequential `n.run_in_actor()` calls; empirically 12s
flakes on `main_thread_forkserver` under UDS
cross-pytest contention (#451 / #452).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 12s (unchanged)
Hoist the timeout-pick out of the `main()` closure so the
dispatch happens once in the trio task rather than
re-evaluating per spawn.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Default `timeout` from `int = 3` → `int|None = None`;
when unset, pick a backend-aware value. Fork-based
backends (`main_thread_forkserver`) need real headroom
bc actor spawn + IPC ctx-exit + msg-validation error
path is much heavier than under `trio` backend —
especially under cross-pytest-stream contention (#451).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 3s (unchanged)
Empirical flake history that motivated 30s as the floor
on fork backends (all from `test_basic_payload_spec`):
- 3s → all-valid variant flaked w/ `TooSlowError`
- 8s → `invalid-return` variant flaked w/ `Cancelled`
(surfaced instead of `MsgTypeError` bc the
outer `fail_after` fired mid-error-path)
- 15s → flaked under cross-pytest-stream contention
30s gives plenty of headroom while still failing-loud
on a genuine hang. Callers can opt out by passing an
explicit `timeout=` kw.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`timeout = 200` was firing via SIGALRM (the default
`method='signal'`) which synchronously raises `Failed` in
trio's main thread mid-`epoll.poll()`, abandoning trio's
runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half-
installed. EVERY subsequent `trio.run()` in the same pytest
session then bails with
`RuntimeError: Attempted to call run() from inside a run()`.
Empirical impact: a session that hits a single 200s hang
cascades into 30-40 false-positive failures across every
downstream test file that uses `trio.run`. Recent UDS run
saw 1 real timeout (`test_unregistered_err_still_relayed`)
poison 38 sibling tests with cascade-fails — a debugging
nightmare.
Same architectural bug we already documented in
`tests/test_advanced_streaming.py::test_dynamic_pub_sub`
(see its module-level NOTE) — both `pytest-timeout`
enforcement modes are incompatible with trio under fork-
based spawn backends. Now scoped session-wide.
For tests that legitimately need a wall-clock cap, the
canonical pattern is `with trio.fail_after(N):` INSIDE the
test — trio's own `Cancelled` machinery cleanly unwinds
the actor nursery without disturbing global state.
For CI: rely on job-level wall-clock timeouts (e.g. GitHub
Actions `timeout-minutes`) to abort genuinely-stuck suites.
`pyproject.toml` comment block spells this all out so a
future contributor doesn't reach back for `timeout =` and
re-introduce the bug.
ALSO, bump `xonsh` to at least `0.23.0` release.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop `@pytest.mark.timeout(...)` for the per-test wall-clock
cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)`
inside `main()` instead.
Both pytest-timeout enforcement modes are incompatible with
trio under fork-based backends:
- `method='signal'` (SIGALRM) synchronously raises `Failed`
in trio's main thread mid-`epoll.poll()`, leaving
`GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got
abandoned") so EVERY subsequent `trio.run()` in the same
pytest process bails with
`RuntimeError: Attempted to call run() from inside a run()`
— full-session poison.
- `method='thread'` calls `_thread.interrupt_main()` which
can let the KBI escape trio's `KIManager` under fork-
cascade teardown races and bubble out of pytest entirely
— kills the whole session.
`trio.fail_after()` keeps cancellation inside the trio loop:
- Raises `TooSlowError` cleanly through the open-nursery's
cancel cascade.
- Doesn't disturb any out-of-band signal/thread state.
- Failure stays scoped to the single test — no cross-test
global state corruption either way.
Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub`
go from 5/10 fail (with global-state poison) to 3/10 fail
(no poison, all sibling tests still pass). The ~30%
remaining flake rate is a genuine fork-cancel-cascade
hang — separate from this fix but no longer contaminates.
Module-level NOTE comment explains the rationale so future
readers don't re-introduce the bug.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.
Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...
Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.
Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".
Switch to per-call `random.randint()` salted with `os.getpid()`
so:
- within one session: two calls return distinct ports — e.g.
`test_tpt_bind_addrs::bind-subset-reg` now actually gets two
different reg addrs on the TCP backend (it was silently
duplicating before),
- across parallel sessions: pid salt biases each process's
port choices apart, making cross-run collisions
vanishingly rare.
Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `test_subint_forkserver_key_errors_cleanly` — a tn-tier
regression guard that pins down the variant-2 reservation
contract: the `'subint_forkserver'` key in
`_spawn._methods` MUST raise `NotImplementedError` today,
not silently dispatch to `main_thread_forkserver_proc`.
The transient alias-state existed briefly during the rename
(commit `57dae0e4`'s "Split forkserver backend into variant
1/2 mods" landed the alias; `5e83881f` flipped it to the
stub). Without a guard, a future refactor could easily
re-collapse the two keys back to a single coro and silently
break the variant-1 / variant-2 contract.
Also asserts the stub's error msg surfaces the two pointers
an operator hitting it actually needs:
- `'main_thread_forkserver'` — the working backend they
prolly meant,
- `'msgspec#1026'` — the upstream blocker that has to land
before variant-2 can ship.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:
- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
`main_thread_forkserver`, drop the stub-only `subint_forkserver`
entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
reason-text since the `mp.SharedMemory(track=False)`
+ resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
`_testing.pytest`, `_subint.py`, `pyproject.toml`,
`test_cancellation.py`, `test_registrar.py` — refs to variant-1
backend updated.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Rename `tests/spawn/test_subint_forkserver.py` →
`test_main_thread_forkserver.py` and migrate its imports +
internal refs to the new canonical names:
- `fork_from_worker_thread`, `wait_child` → from
`tractor.spawn._main_thread_forkserver`.
- `run_subint_in_worker_thread` → still from `_subint_forkserver`
(variant-2 primitive).
- Module docstring + tier-3 fixture + the `*_spawn_basic` test fn
renamed for variant-1-honesty.
- Orphan-harness subprocess argv flipped from `'subint_forkserver'`
→ `'main_thread_forkserver'`.
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` imports split
the same way.
`tractor/spawn/_subint_forkserver.py` drops the backward- compat
re-exports of the fork primitives — the only consumers (test file
+ smoketest) now import from `_main_thread_forkserver` directly.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reduce `_subint_forkserver.py` to its variant-2 placeholder shape:
- Add `subint_forkserver_proc` async stub raising `NotImplementedError`
with a redirect msg pointing at the working variant-1 backend
(`main_thread_forkserver`), jcrist/msgspec#1026 (upstream PEP 684
blocker), and #379 (subint umbrella).
- `tractor.spawn._spawn._methods['subint_forkserver']` now dispatches to
the stub instead of aliasing the variant-1 coroutine
— `--spawn-backend=subint_forkserver` errors cleanly.
- Drop now-dead module-scope: `ChildSigintMode`
/ `_DEFAULT_CHILD_SIGINT` defs, `_has_subints` try/except (replaced
with import from `._subint`), unused imports (`partial`, `Literal`,
`sys`, msgtypes/pretty_struct, `current_actor`,
`cancel_on_completion`/`soft_kill`, `_server` TYPE_CHECKING).
- Backward-compat re-exports of fork primitives kept until the follow-up
commit migrates external test imports.
- `tests/spawn/test_subint_forkserver.py::forkserver_spawn_method`
fixture: flip hardcoded `'subint_forkserver'`
→ `'main_thread_forkserver'` so the test still exercises the working
backend (full file rename comes in the test-import migration commit).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The `subint_forkserver` name was always aspirational —
today's impl forks from a regular main-interp worker
thread and the child runs trio on its own main interp;
NO subinterp anywhere in parent or child. Splitting the
backend into two clearly-named variants drops the lie:
- **variant 1** — `main_thread_forkserver` (the working
impl). New `SpawnMethodKey` literal + `_methods`
dispatch entry + `_runtime.Actor._from_parent()`
match-arm. The spawn-coro `subint_forkserver_proc`
moves to `_main_thread_forkserver` and is renamed
`main_thread_forkserver_proc()`.
- **variant 2** — `subint_forkserver` (future, reserved).
Module shrinks to a placeholder describing the
variant-2 design (subint-isolated child runtime, gated
on jcrist/msgspec#1026 + PEP 684). Today the legacy
`'subint_forkserver'` key aliases to
`main_thread_forkserver_proc` so existing
`--spawn-backend=subint_forkserver` invocations keep
working; flipped to a `NotImplementedError` stub in a
follow-up.
Deats,
- `Actor._from_parent()` spawn-method gate now accepts
both `'main_thread_forkserver'` and
`'subint_forkserver'` (both go through the
IPC-`SpawnSpec` path).
- the variant-1 spawn-coro stamps its own `SpawnSpec` /
log lines with `spawn_method='main_thread_forkserver'`
so subactor renders reflect the actual mechanism.
- docstring reorg: trio×fork hazard breakdown, POSIX
fork-survival semantics, in-process-vs-stdlib
forkserver design notes, and the TODO/cleanup section
all move from `_subint_forkserver` to
`_main_thread_forkserver` (lives with the working
code). `_subint_forkserver` keeps a tight forward-
looking doc that motivates the reserved key.
- `run_subint_in_worker_thread()` stays in
`_subint_forkserver` as the companion primitive — it's
the subint counterpart to `fork_from_worker_thread()`
and will plug into the future variant-2 spawn-coro.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Move the truly-generic main-interp-worker-thread fork primitives
(`fork_from_worker_thread`, `_close_inherited_fds`, `_ForkedProc`,
`wait_child`, `_format_child_exit`) out of `_subint_forkserver.py` into
a sibling `_main_thread_forkserver.py` module so the primitive layer is
honestly named — none of these helpers touch a subint, they just fork
from a main-interp worker thread.
`_subint_forkserver.py` keeps its public surface intact via re-export so
any existing `from tractor.spawn._subint_forkserver import ...` callsite
still resolves.
Net: zero behavior change, preps the way for the upcoming spawn-method
key split where `main_thread_forkserver` ships as the working backend
and `subint_forkserver` becomes reserved for the future
subint-isolated-child variant (gated on jcrist/msgspec#1026).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adds a "Future arch — what subints would buy us" section to
the module docstring, complementing the prior commit's
current-state rationale. Code is unchanged.
Frames the `subint` prefix as family-naming today (no actual
subinterp is created yet), then lays out the three concrete
wins that land once jcrist/msgspec#1026 unblocks PEP 684
isolated-mode subints:
- Cheaper forks — moving the parent's `trio.run()` into a
subint shrinks the main-interp COW image the child inherits.
The main interp becomes the literal forkserver: an
intentionally-empty execution ctx whose only job is to call
`os.fork()` cleanly.
- True parallelism — per-interp GIL means the forkserver
thread on main and the trio thread on subint actually run in
parallel. Spawn latency stops stalling the trio loop.
- Multi-actor-per-process — the architectural payoff. With
per-interp-GIL subints, one process can host main + N
subint-resident actor `trio.run()`s, and `os.fork()` reverts
to the last-resort spawn (only when OS-level isolation is
actually needed). Joins the story with the in-thread
`_subint.py` backend: `subint` → in-process spawn,
`subint_forkserver` → cross-process when a real OS boundary
is required.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Major expansion of the module docstring. Code is
unchanged; this lands the architectural reasoning that
was previously implicit, plus the POSIX/trio fork
mechanics the design relies on.
New sections:
- "Design rationale" — answers two implicit questions:
(1) why a forkserver pattern at all (vs. forking
directly from a trio task), (2) why in-process (vs.
stdlib `mp.forkserver`'s sidecar process). Documents
the three costs the in-process design avoids
(sidecar lifecycle, per-spawn IPC, cold-start child)
and the tradeoffs we accept in exchange (3.14-only,
heavier than `to_thread.run_sync`).
- "Implementation status" — clarifies what's actually
landed today vs. the envisioned arch: parent's
`trio.run()` still lives on main interp (subint-
hosted root gated on jcrist/msgspec#1026). Names
why the "subint" prefix is correct anyway — same PR
series as `_subint.py` / `_subint_fork.py`.
- "What survives the fork? — POSIX semantics" — POSIX
preserves only the calling thread, so the
`trio.run()` thread is gone in the child. Includes
a small parent/child thread-survival table and
covers the four artifact classes that DO cross the
fork boundary (inherited fds, COW memory, Python
thread state, user-level locks) and how each is
handled.
- "FYI: how this dodges the `trio.run()` × `fork()`
hazards" — itemizes each class of trio process-
global state (wakeup-fd, `epoll`/`kqueue`,
threadpool, cancel scopes / nurseries, `atexit`,
foreign-language I/O) and explains how the
forkserver-thread design avoids each.
Also,
- bump the gated msgspec issue link from
`jcrist/msgspec#563` to `jcrist/msgspec#1026` (the
PEP 684 isolated-mode tracker).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two diagnostic gaps in `tractor.spawn._subint.subint_proc()` that hid
otherwise-silent failures, plus tracking-issue links on the two open
`subint_forkserver` follow-ups.
Deats,
- bootstrap-exc visibility: wrap the call to
`_interpreters.exec(interp_id, bootstrap)` with
`try/except BaseException` + `log.exception(...)`.
* Without it, an `ImportError` / `SyntaxError` raised inside the
dedicated driver thread goes only to Python's default thread
excepthook — invisible to the parent, which then waits forever on
`subint_exited.wait()`.
* `?TODO` notes `anyio`'s `to_interpreter._interp_call` +
`(retval, is_exception)` pattern as the next step for re-raising;
skipped now bc it must coordinate with the `trio.Cancelled` paths
around the existing `.wait()` calls.
- cancel-leak disambiguation: when the driver thread doesn't exit within
`_HARD_KILL_TIMEOUT`, also log `_interpreters.is_running(interp_id)`
as `subint_still_running=...` so the operator can tell "thread leaked,
subint already done" apart from "thread alive bc subint is wedged".
* pattern borrowed from `trio-parallel`'s `_sint.SintWorker.is_alive()`.
- `?TODO` near the `bootstrap` literal: future switch to
`_interpreters.set___main___attrs()` — same API `anyio`
uses in `to_interpreter._Worker.call()` — for passing
non-`repr()`-roundtrippable values (`SpawnSpec` struct, callables,
etc).
* add cross-refs tracking issue `#379`.
Also,
- `Tracked at: [#449]` link on
`subint_forkserver_test_cancellation_leak_issue.md`.
- `Tracked at: [#450]` link on
`subint_forkserver_thread_constraints_on_pep684_issue.md`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Same wire-up pattern as the prior `test_dynamic_pub_sub`
commit: each test that already pulled in `debug_mode`
now also pulls in `reg_addr` and passes
`registry_addrs=[reg_addr]` into `tractor.open_nursery()`,
so the suite's standard registry-addr conventions apply.
Tests touched:
- `test_started_misuse`
- `test_simple_context`
- `test_parent_cancels`
- `test_one_end_stream_not_opened`
- `test_maybe_allow_overruns_stream`
- `test_ctx_with_self_actor`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Pull in the `reg_addr`, `debug_mode`, and `test_log`
fixtures so this test follows the same conventions as
the rest of the suite:
- pass `registry_addrs=[reg_addr]` + `debug_mode` into
`tractor.open_nursery()` (so `--tpdb` etc work).
- after the `pytest.raises` block, add `assert err` +
`test_log.exception('Timed out AS EXPECTED')` so the
expected timeout is logged explicitly instead of
swallowed.
Also,
- drop whitespace-only blank lines around the
`subs` param of `consumer()` and `ctx` param of
`one_task_streams_and_one_handles_reqresp()`.
- promote `test_sigint_both_stream_types`'s one-line
docstring to multi-line form.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Seems that when run in-suite it delays more then the so-measured "happy
path" timing; better to have no suite-global interruption then asserting
a fast single test's run.
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.
Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
+ `reap_shm()` helpers. Match criteria: regular file
under `/dev/shm`, owned by current uid, AND no live
proc has it open (mmap'd or fd-held). In-use
enumeration via `psutil.Process.memory_maps()` +
`.open_files()` — xplatform, kernel-canonical (same
answer `lsof` would give), no reliance on
tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
`NotImplementedError` outside Linux/FreeBSD bc macOS
POSIX shm has no fs-visible path (`shm_open` only)
and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
process reap) and `--shm-only` (skip process phase)
flags. `-n` dry-runs both phases. Exit code is `1`
if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
the `testing` dep group; lazy-imported in `_reap.py`
so the process-reap path stays import-clean without
it.
Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
(new section 10c) — covers match criteria + the
preservation guarantee for unrelated apps.
- flip mitigation status in
`subint_forkserver_mp_shared_memory_issue.md` from
"could extend `tractor-reap`" to "implemented", with
a note that callers should still UUID-pin shm keys to
avoid cross-session collisions.
Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.
Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
`platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
now run unconditionally:
* monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
`ManTracker` subclass (empty `register` / `unregister`
/ `ensure_running`).
* return `partial(SharedMemory, track=False)` for the per-allocation
opt-out.
* belt + suspenders: even if something dodges the wrapper, the
singleton can't talk to the inherited (broken) parent fd.
- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
skip of the unlink-callback; install a `try_unlink()` wrapper that
swallows `FileNotFoundError` (sibling-already-cleaned race in
shared-key setups). Without `mp.resource_tracker` doing it for us, we
own the unlink — `actor. lifetime_stack` is the right place since
tractor already controls actor lifecycle.
- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
module-level skip- list (tests pass now). Inline comment cross-refs
the two `_mp_bs` / `_shm` workarounds.
- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
rewrite — flips status from "open / unresolvable in tractor" to
"resolved, kept as decision record". Adds Resolution section, "Why
this is the right call" rationale (mp tracker is widely criticized;
tractor already owns lifecycle), trade-offs (crash-leaked segments,
lost mp leak warning), verification (7 passed under both
`subint_forkserver` and `trio` backends), and upstream issue links
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/` doc: `mp.SharedMemory` is
fork-without-exec unsafe — child inherits parent's
`resource_tracker` fd → EBADF on first shm op;
leaked `/shm_list` cascades `FileExistsError`
across parametrize variants. Canonical CPython
issue class, NOT a tractor bug. Includes two
longer-term mitigation paths (reset inherited
tracker fd vs migrate off `mp.shared_memory`).
Also, update `tests/test_shm.py`:
- comment out `subint_forkserver` from skip list
- rewrite reason with precise failure-mode
descriptions + link to the analysis doc
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:
- orphan-mode (default): finds PPid==1 procs
with cwd matching repo root + `python` in
cmdline.
- descendant-mode (`--parent <pid>`): scoped
sweep under a still-live supervisor.
SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).
Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.
Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
`/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
- `find_descendants(parent_pid)` for the in-session case
(PPid-direct-match while pytest is still alive).
- `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
reparented to init + `cwd` filter to repo root + `python` cmdline
filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
all, poll up to `grace` for exit, SIGKILL any survivors. Returns
`(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
`tractor/_testing/pytest.py` — after `yield`, runs
`find_descendants(os.getpid())` + `reap(...)` so each pytest session
leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
commit) for the pytest-died-mid-session case where the in-session
fixture didn't get to run.
Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
module-top in `pytest.py` (was inline-imported inside
`pytest_generate_tests`), and reuse it in
`pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
mark arg is a valid spawn-method literal — catches typos at collection
time.
- inline `# ?TODO` flags running these through the `try_set_backend`
checker for stronger validation.
Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Re-classify `test_orphaned_subactor_sigint_cleanup_DRAFT` from
flakey-env-sensitive (`strict=False` w/ "passes in isolation, flakey in
full suite") to a hard known-gap (`strict=True`) with the orphan-SIGINT
hang as the documented cause. The previous framing ("env pollution") let
the test silently pass when ordering happened to favor it; the new
framing forces an XPASS-as-FAIL the moment the underlying gap is
actually closed, so we can drop the mark intentionally instead of
accidentally.
Reason text + leading `# Known-gap test —` comment both point at
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
for the full diagnosis.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.
Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
- add `reg_addr`, `debug_mode`, `start_method`
fixture params
- `delay` now reads the `debug_mode` param directly
instead of calling `tractor.debug_mode()` (fires
slightly earlier in the test lifecycle)
- sanity assert `if debug_mode: assert
tractor.debug_mode()` after nursery open
- new print showing SIGINT target
(`send_sigint_to` + resolved pid)
- catch `trio.TooSlowError` around
`ctx.wait_for_result()` and conditionally
`pytest.xfail` when `send_sigint_to == 'child'
and start_method == 'subint_forkserver'` — the
known orphan-SIGINT limitation tracked in
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code