Add `disable_pdbp_color()` to the `sync_bp` example
to suppress pygments prompt coloring when
`PYTHON_COLORS=0` — makes pexpect pattern matching
deterministic.
Deats,
- set `loglevel='pdb'` in both script + test spawn.
- disable `enable_stack_on_sig` in example, assert
no `stackscope` output in test.
- update `attach_patts` keys/values with `|_<Task`
/ `|_<Thread` / `|_('subactor'` prefixes to match
actual tree-dump format.
- add call-site patterns (`tractor.pause_from_sync()`
`tractor.pause()`, `breakpoint(hide_tb=...)`).
- trim trailing `\n` from `Lock.repr()` output.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit fc2e298a29)
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:
- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
`main_thread_forkserver`, drop the stub-only `subint_forkserver`
entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
reason-text since the `mp.SharedMemory(track=False)`
+ resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
`_testing.pytest`, `_subint.py`, `pyproject.toml`,
`test_cancellation.py`, `test_registrar.py` — refs to variant-1
backend updated.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 205382a39b)
(factored: dropped spawn-backend-only path: tractor/spawn/_subint.py)
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.
Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
listing the spawn backends whose subactor runtime is trio-native
(`'trio'`, `'subint_forkserver'`). Both the enable-site
(`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
key.
off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
reports the full compatible-set + the rejected method instead of the
stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
`'subint_forkserver'` to the existing `('trio', 'subint')` tuple
— fork child-side runtime receives the same SpawnSpec IPC handshake as
the others.
- `subint_forkserver_proc` child-target now passes
`spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
`Actor.pformat()` / log lines reflect the actual parent-side spawn
mechanism rather than masquerading as plain `trio`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8bcbe730bf)
Complete the xontrib-side slimming from "Mv core impl
`tractor_diag.xsh` to `_testing.trace`" (its xontrib hunk
couldn't ride with the testing-harness segment since the
xontrib lands in this one) and sync a couple `_reap.py`
issue/doc-ref comment strings to their final form.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Make the sub-actor proc-title prefix a single
authoritative constant (`_proctitle._def_prefix`) so
the reap-recognition markers and `xontrib` banner pick
it up automatically — one place to flip the prefix
shape going fwd.
Deats,
- `_proctitle._def_prefix: str = '_subactor'`. New
module-level const consumed by everything that needs
to know the prefix.
- `set_actor_proctitle(actor, prefix=_def_prefix)`:
takes an explicit `prefix` arg (default = the const)
so callers can override per-spawn if they want.
- Default proc-title format:
`'tractor[<reprol>]'` → `f'{prefix}[<reprol>]'`
i.e. `_subactor[<reprol>]` by default.
- `_testing/_reap.py`: cmdline + comm markers source
the prefix from `_proctitle._def_prefix` instead of
the hardcoded `'tractor['`. So
`_is_tractor_subactor()` tracks the const
automatically.
- `xontrib/tractor_diag.xsh`: `acli.reap` orphan-mode
banner now interpolates the
`_TRACTOR_PROC_CMDLINE_MARKERS` tuple directly so
the human-readable mode line stays in sync if the
prefix shape changes again.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3a45dbd503)
New `tractor.devx._proctitle` mod sets each
sub-actor's `argv[0]` (and kernel `comm`) to
`tractor[<aid.reprol()>]` — e.g.
`tractor[doggy@1027301b]` — so `ps`/`top`/`htop`
and `acli.pytree`/reaper tooling can identify
actors at a glance without parsing full cmdlines.
Deats,
- `set_actor_proctitle()` wraps the `setproctitle`
pkg with `ImportError` guard; optional at runtime
but listed in `pyproject.toml` so default installs
benefit.
- called early in `_child._actor_child_main()` after
`Actor` construction, before `_trio_main()` entry.
- tests in `tests/devx/test_proctitle.py`: format
unit test, `/proc/{cmdline,comm}` integration
test, negative detection test.
Resolves#457
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d60245777e)
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.
Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
`tractor._child` markers, falls back to `/proc/<pid>/comm` for
zombie-resilient detection (kernel preserves `comm` past exit
until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
— the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
`/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
`_is_tractor_subactor()` for intrinsic sub-actor ID instead of
cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
zombie-state sub-actors
Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
distinguish `system.slice` services vs `user.slice` desktop apps
from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
`user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
tractor importable from active venv
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 522b57570b)
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).
Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.
Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
filter rule + its rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit a6d4ac3aac)
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 01ce2857ea)
(factored: also default `find_orphans(repo_root=None)` -> cwd so the new bare call sites work ahead of the later intrinsic-identity rewrite)
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).
Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8de684f5de)
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
populated by `_do_capture_snapshot()` on each successful dump;
add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
prints all captured snapshot dirs at end-of-session so paths
don't get buried in scrollback
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit fb87c36263)
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
`tractor._child` procs NOT in the walk's `seen` set — surfaces
ghost subactor trees from prior test runs (cross-test launchpad
contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
into `_classify_walk()` closure, walk stray roots as additional
trees, emit stray-root summary in header. Also: `tractor._child`
procs reparented to init are now always classified as orphans
regardless of cgroup-slice (leaked subactor ≠ desktop-launched
app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
`--capture=sys` redirection so snapshot paths always land on the
real terminal
- `fail_after_w_trace()`: capture diag snapshot on
non-`TooSlowError` exceptions when the `fail_after` scope's
cancel had already fired (e.g. nursery wraps `Cancelled` into a
`BaseExceptionGroup` that escapes before `TooSlowError` can be
raised).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3a243a1fd4)
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.
Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
hung-state dumper, bindspace scanner, `dump_all()` snapshot
archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
(trio `fail_after` + auto-snapshot on `TooSlowError`),
`afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
`SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
`_testing.trace` — drops ~627 lines of inline impl. Add
`acli.dump_all` alias for full snapshot-bundle CLI access.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7509e313ff)
(factored: the xontrib-side move-out hunk rides with the diag-xontrib segment)
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:
- CI (`CI` env-var set): `pytest.exit(rc=2)` on
mismatch — forces every matrix-row to declare
`--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
experiment with `--capture=fd` to validate fixes.
Deats,
- drop `_cap_fd_set` global; add
`_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
spawner-name lookup
- move inline comment wall → proper docstring w/
Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
`request: pytest.FixtureRequest` and reads
`request.config.option.capture` instead of the
`_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
capture deadlocks)
- `set_fork_aware_capture()` returns the actual
capture mode str from config, not a hardcoded
`'sys'`
- lift `import warnings` to module level (was duped
inside `pytest_configure`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 255c9c3a7c)
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.
Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit e4953851de)
New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).
Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5cf0312c78)
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.
Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
wrapper's fixture injection.
Also,
- add `alert_on_finish` session fixture (terminal
bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
our plugin eventually however..
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d549c72052)
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.
Also,
- load `tractor._testing.pytest` via `-p` in ini
(bc bootstrapping hooks must register before
conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
and pop env var on disable.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 61d4525137)
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:
- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
`reap_uds()` — detect `<name>@<pid>.sock` files under
`${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
lingering subactors, wait, SIGKILL survivors, then sweep orphaned
UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
snapshot sock-file dir before/after each test, warn + reap new
orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
with known-leaky teardown.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 1cdc7fb302)
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.
Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...
Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.
Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b376eb0332)
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".
Switch to per-call `random.randint()` salted with `os.getpid()`
so:
- within one session: two calls return distinct ports — e.g.
`test_tpt_bind_addrs::bind-subset-reg` now actually gets two
different reg addrs on the TCP backend (it was silently
duplicating before),
- across parallel sessions: pid salt biases each process's
port choices apart, making cross-run collisions
vanishingly rare.
Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7c5dd4d033)
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.
Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
+ `reap_shm()` helpers. Match criteria: regular file
under `/dev/shm`, owned by current uid, AND no live
proc has it open (mmap'd or fd-held). In-use
enumeration via `psutil.Process.memory_maps()` +
`.open_files()` — xplatform, kernel-canonical (same
answer `lsof` would give), no reliance on
tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
`NotImplementedError` outside Linux/FreeBSD bc macOS
POSIX shm has no fs-visible path (`shm_open` only)
and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
process reap) and `--shm-only` (skip process phase)
flags. `-n` dry-runs both phases. Exit code is `1`
if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
the `testing` dep group; lazy-imported in `_reap.py`
so the process-reap path stays import-clean without
it.
Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
(new section 10c) — covers match criteria + the
preservation guarantee for unrelated apps.
- flip mitigation status in
`subint_forkserver_mp_shared_memory_issue.md` from
"could extend `tractor-reap`" to "implemented", with
a note that callers should still UUID-pin shm keys to
avoid cross-session collisions.
Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4f12d69b41)
(factored: dropped subint_forkserver conc-anal doc update)
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.
Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
`/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
- `find_descendants(parent_pid)` for the in-session case
(PPid-direct-match while pytest is still alive).
- `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
reparented to init + `cwd` filter to repo root + `python` cmdline
filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
all, poll up to `grace` for exit, SIGKILL any survivors. Returns
`(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
`tractor/_testing/pytest.py` — after `yield`, runs
`find_descendants(os.getpid())` + `reap(...)` so each pytest session
leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
commit) for the pytest-died-mid-session case where the in-session
fixture didn't get to run.
Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
module-top in `pytest.py` (was inline-imported inside
`pytest_generate_tests`), and reuse it in
`pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
mark arg is a valid spawn-method literal — catches typos at collection
time.
- inline `# ?TODO` flags running these through the `try_set_backend`
checker for stronger validation.
Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit eae478f3d5)
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.
Deats,
- `pytest_configure()` registers the marker via
`addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
each collected item with `item.iter_markers(
name='skipon_spawn_backend')`, checks whether the
active `--spawn-backend` appears in `mark.args`, and
if so injects a concrete `pytest.mark.skip(
reason=...)`. `iter_markers()` makes the decorator
work at function, class, or module (`pytestmark =
[...]`) scope transparently.
- First matching mark wins; default reason is
`f'Borked on --spawn-backend={backend!r}'` if the
caller doesn't supply one.
Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3b26b59dad)
A `trio.Nursery.start()`-style wrapper around
`trio.run_process()` that surfaces rc!=0 errors
deterministically, ALWAYS isolates the parent
controlling-tty, and optionally live-relays the child's
std-streams to `log.<level>` per-line. Suits both
short-lived test-runners + long-lived daemons.
`supervise_run_process()`,
- Deterministic rc!=0: pass `check=False` to `trio`
and do our OWN post-drain rc-check from the
supervisor coro body AFTER `own_tn.__aexit__` — NOT
inside the internal nursery, since that would
race-cancel the still-draining relay reader and lose
stderr lines. (Re)build + raise a BARE
`subprocess.CalledProcessError`: `.stderr=` for
programmatic callers + an `add_note()`'d
`|_.stderr:` block for human teardown logs. No
nursery-eg-wrapped CPE to `collapse_eg` around.
- Parent controlling-tty isolation: `stdin=DEVNULL`
always, `stdout=DEVNULL` unless relayed/overridden
(via `stdout=` kwarg w/ `_UNSET` sentinel so explicit
`None` = inherit still works). Prevents a spawned
program from clobbering the launching tty's scrollback
w/ control-seqs.
- Live per-line relay: `relay_stdout=True`/
`relay_stderr=True` → relayed to `log.<relay_level>`
(default `'io'`, our custom level 21). Picked to sort
just above stdlib `INFO`=20 so it shows at usual
`info`/`devx` levels yet stays separately filterable;
`runtime`=15 was REJECTED as a default since it'd be
silently filtered at usual verbosity — footgun for
daemon supervisors whose whole point is visibility.
STREAMED, not buffered-until-exit.
- Non-blocking `tn.start()` semantics: live
`trio.Process` handed up via
`task_status.started()` immediately (else
`tn.start()` would block till child exit, losing
the long-lived-daemon use case). Supervise/relay bg
tasks run to completion in this coro.
- `**run_process_kwargs` forwarded verbatim (env, shell,
cwd, start_new_session, executable, ...); MANAGED keys
(`stdin`/`stdout`/`stderr`/`check`) win on conflict.
- Crash-handling layer intentionally NOT baked in —
compose `maybe_open_crash_handler()` ON TOP at the
call-site.
`_relay_stream_lines()` helper,
- Concurrent pipe-drain reader. MANDATORY whenever piping
w/o `capture_*` since nothing else drains the OS pipe —
child blocks on `write()` once kernel buf (~64KiB) fills
→ deadlock.
- Modes (combine freely): `emit`-only live relay,
`accum`-only silent drain+capture (for the CPE note),
or both. Per-line splitting handles cross-chunk
residuals + flushes any trailing un-newline-term'd line
at EOF.
`_add_stderr_note()` helper,
- Attaches an indented `|_.stderr:` note to a CPE via
`add_note()` for legible rc!=0 reporting at teardown.
Tests (`tests/trionics/test_subproc.py`),
- Hermetic `trio`-only (no actor-runtime).
- `test_stdout_relayed_per_line`: per-line stdout relay.
- `test_parent_tty_isolated`: child fd1 is OUR pipe (no
`/dev/pts/*`), fd0 pinned to `/dev/null`.
- `test_no_deadlock_on_big_unnewlined_output`: 200KiB
no-newline output completes under `fail_after(2)` —
exercises the concurrent drain (without it, the child
blocks at ~64KiB).
- `test_stderr_relay_and_cpe_rebuild`: rc!=0 w/
`relay_stderr=True` → bare `CalledProcessError` w/ the
`.stderr` note + per-line live relay.
- `test_nonrelay_cpe_note`: rc!=0 w/o relay → same
deterministic post-drain CPE w/ `.stderr` note (silent
drain+capture path).
Re-export `supervise_run_process` from `tractor.trionics`.
Prompt-IO: ai/prompt-io/claude/20260601T231429Z_0e3e008b_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit f595acc76c)
Follow-up to f595acc7 (`supervise_run_process`) which
called `log.io(...)` for std-stream relay assuming an
`IO=21` level existed. Add the registration via a new
factory + tests covering both the factory and the new
level.
`add_log_level()` factory,
- One call wires the four (otherwise hand-synced) pieces:
- `CUSTOM_LEVELS[NAME]` — drives the `stacklevel` bump
in `StackLevelAdapter.log()` + `get_logger()`'s
per-level audit.
- `logging.addLevelName()` — stdlib name registration.
- `STD_PALETTE[NAME]` + `BOLD_PALETTE['bold'][NAME]` —
color entries consumed by `get_console_log()`'s
`ColoredFormatter` build.
- Same-named (lowercase) emit method bound on
`StackLevelAdapter` so `log.<name>('msg')` works +
`get_logger()`'s per-level method audit passes.
- Idempotent: re-registering an existing name is a
no-op-ish refresh that won't clobber an already-bound
method.
- Method binding uses a default-arg `_level=value` so
the level int is captured (not late-bound across
multiple registrations).
`IO=21` level (first user),
- Purple. Used by `tractor.trionics._subproc`'s
std-stream relay (see f595acc7).
- Value 21 picked to sit just ABOVE stdlib `INFO`=20 so
it's SHOWN BY DEFAULT at usual `info`/`devx` console
levels — a `runtime`=15 relay would be silently
filtered (footgun for daemon supervisors whose whole
point is visibility). Still distinctly labeled +
filterable.
Tests (`tests/test_log_sys.py`),
- `test_io_custom_level_registered`: validates the IO
level is fully wired (`CUSTOM_LEVELS`, `addLevelName`,
both palettes, `StackLevelAdapter.io()` callable);
emits a record + sanity-asserts `21 >= INFO(20)`.
- `test_add_log_level_pluggable`: registers a fresh
`XLVL=19` (cyan) via `add_log_level()`, asserts all
four wires + the bound `xlog.xlvl()` emit, then
try/finally cleans up the module-global mutations so
later `get_logger()` audits don't trip on a
half-removed level.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7bd7dd50c7)
Strip the trailing `pkg_path` token ONLY when it duplicates the
caller's leaf-*module* name (which the console header already
shows via `{filename}`), instead of blindly dropping the last
token. This keeps genuine, possibly-*nested* sub-PACKAGE parts
addressable as their own sub-loggers.
- detect a true leaf-mod by comparing the caller's `__name__`
vs `__package__` (a pkg `__init__` has them equal -> its
trailing token is a real sub-pkg, NOT a leaf to strip).
- `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT
from a bare `devx` -> `tractor.devx`; the old unconditional
`pkg_path = subpkg_path` collapsed both to `tractor.devx` and
silently broke per-sub-pkg level control via the logging-spec.
- `get_logger(__name__)` leaf-strip still works (cosmetic, bc
the leaf-mod is in the `{filename}` header field).
Also,
- update the `LogSpec` caveat: sub-PACKAGE granularity now
addressable at ANY depth; leaf *modules* intentionally aren't
(they're the `{filename}`); top-level mods (eg. `to_asyncio`)
still emit on the root logger.
- adjust `test_root_pkg_not_duplicated_in_logger_name` to the
new literal explicit-`name` contract (no leaf-collapse).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9c36363b01)
Two coupled changes that let downstream projects (eg. `modden`) inherit
the test-harness loglevel plumbing for free via
`tractor._testing.pytest`:
Plugin lift (`tests/conftest.py` → `_testing/pytest.py`),
- mv `pytest_addoption(--ll)`, the `loglevel` autouse
fixture, and `test_log` fixture out of the test-suite-
local conftest into the reusable plugin.
- add `--tl`/`--tractor-loglevel` as a DISTINCT flag from
`--ll`: `--ll` is the consuming-project's OWN app
loglevel (scoped to its pkg-hierarchy), `--tl` is the
`tractor.*` runtime loglevel. `--tl` falls back to
`--ll` when unset (preserves current `tractor`-suite
behavior).
- add `testing_pkg_name` session fixture (default
`'tractor'`) — downstream projects override to e.g.
`'modden'` so `--ll` scopes to their own hierarchy
instead of `tractor.*`.
- `loglevel` fixture now yields the resolved
tractor-runtime level (passed to
`open_root_actor(loglevel=<.>)` by `@tractor_test`)
AND separately applies `--ll` to the
`testing_pkg_name` hierarchy when that isn't
`tractor`. `test_log` scopes the per-test logger to
`testing_pkg_name`.
`tractor.log` "logging-spec" mini-DSL,
- `LogSpec = str|bool`. Accepted forms:
- `True` → enable `pkg_name` root at `default_level`
(fallback `'cancel'`).
- `False` → no-op.
- bare level eg. `'info'` → root-logger at that level.
- `'sub:info,x:cancel'` → per-sub-logger filter-spec;
each `<name>` is RELATIVE to `pkg_name` (must NOT
include the pkg-token).
- `parse_logspec()` → `{sublog|None: level}` mapping.
`None` key = root-logger. Mixed bare-level + filters
in one spec is rejected w/ a helpful err msg; so is
embedding the `pkg_name` token in a sub-name.
- `apply_logspec()` → `(primary_level, {name: log})`:
parses then enables a `colorlog` stderr handler per
named (sub)logger. Authoritative sub-logger filters
get `propagate=False` so they don't double-emit
through a parallel root-level handler.
- !GRANULARITY CAVEAT! sub-logger names match at
sub-pkg granularity, not leaf-module — so `devx.debug`
collapses to the same `tractor.devx` logger as a bare
`devx`, and top-level lib modules (eg.
`tractor.to_asyncio`) emit under the *root* logger
rather than a phantom `to_asyncio` child. Documented
inline on `LogSpec`.
Other,
- `tests/conftest.py` keeps a NOTE pointing to the
plugin for future-debugging clarity (don't remove
silently — the lift is the relevant signal).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 19a77708ba)
SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).
Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
conjunction around the `enable_stack_on_sig()` call;
now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
`use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
of the `if rvs['_debug_mode']:` block to peer-level.
The `use_greenback` branch stays inside `_debug_mode`
(pdb-specific).
- Refresh inline comments on both sites to call out the
infected-`asyncio` "default SIGUSR1 = terminate proc"
rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3d9c75b6ed)
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.
Deats,
- add `use_stackscope: bool` to `RuntimeVars`
struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
successful `stackscope` import, asserts unset
on `ImportError`
- nest stackscope init under `_debug_mode` gate
in `Actor.async_main`, check rtvar alongside
env var
- defer `maybe_init_greenback` import to its own
`use_greenback` branch
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 48523358cf)
Factor the sub-actor relay loop out of
`dump_tree_on_sig()` into `_relay_sig_to_subactors()`
and chain both dump + relay in a single
`run_sync_soon` callback (`_dump_then_relay`) so the
parent's task-tree flushes BEFORE any sub receives
the signal — fixes a hierarchical-ordering race
where subs could dump ahead of the parent in the
muxed pty stream.
Also,
- gate file/tty sink writes behind `write_file` +
`write_tty` params on `dump_task_tree()`.
- use `actor.aid.uid` instead of deprecated `.uid`.
- update `test_shield_pause` expects to match the
new sequential parent -> relay-log -> sub ordering.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit e2b790a70d)
Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.
Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.
Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).
`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.
Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
hint in the rendered dump header so users know where
to find the artifact even when stdio is captured.
- swap the inline `f' |_{actor}'` line for a
`_pformat.nest_from_op` rendering of `actor_repr`
(matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
branches now note `(trio_token captured: <bool>)`
so it's obvious from the log whether the full-tree
path is wired.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 2d4995e08d)
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.
Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
* exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
inherit it via environ,
* installs the handler in pytest itself via
`enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
existing `_debug_mode` gate to ALSO fire when
`TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
install the same handler at runtime startup.
Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:
- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
always written) — guaranteed-readable artifact even
under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
tty; ignored if device is missing.
Other,
- squelch the benign `RuntimeWarning` ("coroutine method
'asend'/'athrow' was never awaited") from
`stackscope._glue`'s import-time async-gen type
introspection so `--enable-stackscope` setup stays
quiet.
- log msg in the `_runtime` ImportError branch now
mentions `--enable-stackscope` alongside debug-mode.
Usage,
pytest --enable-stackscope -k <hang-test>
# in another shell, find the pid + signal:
kill -USR1 <pytest-or-subactor-pid>
# tail the artifact:
tail -f /tmp/tractor-stackscope-<pid>.log
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5418f2dc3c)
(factored: only the flag + activation hunks; the surrounding skipon-marker/reap-fixture context rides with the testing-harness segment)
Bottle up the diagnostic primitives that actually cracked the
silent mid-suite hangs in the `subint` spawn-backend bringup (issue
there" session has them on the shelf instead of reinventing from
scratch.
Deats,
- `dump_on_hang(seconds, *, path)` — context manager wrapping
`faulthandler.dump_traceback_later()`. Critical gotcha baked in:
dumps go to a *file*, not `sys.stderr`, bc pytest's stderr
capture silently eats the output and you can spend an hour
convinced you're looking at the wrong thing
- `track_resource_deltas(label, *, writer)` — context manager
logging per-block `(threading.active_count(),
len(_interpreters.list_all()))` deltas; quickly rules out
leak-accumulation theories when a suite progressively worsens (if
counts don't grow, it's not a leak, look for a race on shared
cleanup instead)
- `resource_delta_fixture(*, autouse, writer)` — factory returning
a `pytest` fixture wrapping `track_resource_deltas` per-test; opt
in by importing into a `conftest.py`. Kept as a factory (not a
bare fixture) so callers own `autouse` / `writer` wiring
Also,
- export the three names from `tractor.devx`
- dep-free on py<3.13 (swallows `ImportError` for `_interpreters`)
- link back to the provenance in the module docstring (issue #379 /
commit `26fb820`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 09466a1e9d)
Matches the explicit `dict.pop(uid, None)` contract one
line above; same semantics as the prior truthy check.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0e3e008b0c)
`30e15925` ("Add `start_or_cancel()` to `trionics._taskc`")
inserted `async def start_or_cancel()` — whose body opens its
own col-4 `try:` — immediately before the trailing `else:
raise`. Because the edit was a pure insertion (0 deletions),
the *same* `else: raise` lines were silently REPARENTED: they
used to be the `for exc_match in matching: ... else: raise`
of `maybe_raise_from_masking_exc`, but now bind to
`start_or_cancel`'s `try/except` where they're unreachable
dead code.
Net effect: `maybe_raise_from_masking_exc` lost the `for/else`
re-raise of the un-masked exception, so a masked child
cancellation gets swallowed instead of surfaced.
- restore the `for/else: raise` to `maybe_raise_from_masking_exc`
- drop the now-dead `else: raise` from `start_or_cancel`
Surfaced as 2 deterministic failures in
`test_sigint_closes_lifetime_stack[wait_for_ctx-bg_aio_task-
send_SIGINT_to=child-*]` (the SIGINT-to-child "silent-abandon"
regime). Bisected with `trio` held at `0.29.0`: clean at
`9c36363b` (0/8), broken at `30e15925` (8/8), fixed (0/8).
NOT a `trio` (0.29↔0.33 identical) nor logging-plugin
regression.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 325574cc07)
Wrapper around `trio.Nursery.start()` that DOESN'T mask
out-of-band cancellation as a lossy startup failure.
Picks the right re-raise: ambient `Cancelled` when
present, the genuine startup-protocol `RuntimeError`
otherwise.
The problem,
- `trio.Nursery.start()` raises a generic
`RuntimeError("child exited without calling
task_status.started()")` whenever the started task
exits BEFORE calling `task_status.started()` —
INCLUDING the common case where the child was
cancelled out-of-band by an *ancestor* cancel-scope
erroring/cancelling.
- In that case the original `trio.Cancelled` is
swallowed and the caller is left w/ an opaque,
root-cause-detached `RuntimeError`.
The fix,
- Catch the "...started" RTE.
- `await trio.lowlevel.checkpoint_if_cancelled()` —
re-raises the in-flight `Cancelled` IFF we're under
effective cancellation (ancestor-inclusive), carrying
trio's auto-generated reason which points at the true
root exc.
- If we're NOT cancelled the `checkpoint_if_cancelled()`
is a cheap no-op and we fall through to re-raise the
genuine startup-protocol RTE.
Re-export from `tractor.trionics` so callers don't have
to reach into `_taskc`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 30e15925ba)
Factor the "deliver an exc to a running aio task" pattern out of
`translate_aio_errors()` + `open_channel_from()` into a shared
`maybe_signal_aio_task()` helper. Add a cause-chain matrix comment
+ relay-echo guard so the final-raise block can't cycle
`trio_err.__cause__` back onto its own derivative relay.
`maybe_signal_aio_task()`,
- Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT
`aio_task.set_exception()` which on py3.13+ ALWAYS raises
`RuntimeError("Task does not support set_exception")` (dead code as
a relay mechanism).
- Returns `(delivered: bool, report: str)`. Caller uses `delivered` to
flip `wait_on_aio_task` when delivery failed (avoids hanging on
`_aio_task_complete.wait()`).
- `pre_captured_fut=`: required when the caller crosses a trio
checkpoint between capturing `_fut_waiter` and invoking the helper.
`Task._wakeup` clears `_fut_waiter = None` so re-reading
post-checkpoint loses the ref even though the exc is still in-flight
on the (now-`done()`) original fut.
- `cause=`: sets `exc.__cause__ = cause` so the relay carries
a "trio_err -> caused -> relay" chain through `set_exception()`
→ `Task._wakeup` → coro raise → `wait_on_coro_final_result`
→ `signal_trio_when_done` → `task.result()`-raise.
- `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the
narrow case where `_fut_waiter is None` AND task is runnable (sitting
in asyncio's ready queue, not parked on a poke-able future). NEVER
cancels when `_fut_waiter` carries an in-flight exc — that would race
+ mask the real terminating exc.
`translate_aio_errors()`,
- Replace the two ad-hoc `_fut_waiter.set_exception()`
/ `aio_task.set_exception()` call sites w/ the helper.
- Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown
`trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref).
- New "cross-loop cause-chain matrix" comment block on the final-raise
— tabulates every `(trio_err, aio_err, trio_to_raise)` combo into
exactly one terminal `raise X [from Y]` or early `return`. Covers the
sibling `signal_trio_when_done()` resolution + the relay-echo
INVARIANT.
- New relay-echo guard: if `aio_err` is one of OUR OWN signals
(`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is
trio_err`, raise the bare `trio_err` instead of `trio_err from
aio_err` (which would CYCLE the cause chain since the relay was itself
caused-by `trio_err`).
- Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT
EXCEPT or this WILL HANG" warning — the helper handles the failure
path explicitly via `delivered=False` → `wait_on_aio_task = False`.
- Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and
the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback
shows the real root.
`open_channel_from()`,
- Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?"
+ `tractor.pause(shield=True)` panic branch.
- Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as
`cause=` — non-`None` only when the `try` body raised (graceful exit
→ None).
Other,
- Top-level import: `sys` (for `sys.exc_info()`).
- `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await
trio_done_fute` next to the shielded version — exploratory note for
the longstanding "why is `.shield()` needed?" TODO.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit acd1cbeec4)
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:
graceful cancel-req -> bounded wait -> hard-kill
Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
of the legacy DEBUG-log + return-`False` path. Default stays `False`
for callers that handle their own escalation (e.g.
`_spawn.soft_kill()` polling `proc.poll()`).
- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
(SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
waiting on `proc.poll()`.
- replace `tn.start_soon(portal.cancel_actor)` in
`ActorNursery.cancel()` with the helper.
Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
opt-in).
ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.
Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 34f333a026)
Distinct from `trio.TooSlowError` so that existing
`except trio.TooSlowError:` blocks don't silently
mask actor-cancel timeouts — these must propagate
to let a supervisor escalate to
`proc.terminate()` per SC-discipline:
graceful cancel-req -> bounded wait -> hard-kill
Motivated by #subint_forkserver dup-name hang
where `Portal.cancel_actor()` silently swallowed
the timeout and the supervisor never escalated,
leaving a same-named sibling subactor parked
forever.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 38ffb875bd)
Pre-compute `mismatch_lines` str instead of `+`-concat
inside the f-string raise site; slightly easier to read
and avoids the `+ '\n\n'` continuation.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5cd06810db)
Raise `ValueError` from `open_root_actor()` when any
`registry_addrs` entry uses a transport proto not in
`enable_transports` — historically this caused a
silent indefinite hang during the registrar handshake
(the actor could never connect to register/discover).
Also,
- update `test_root_passes_tpt_to_sub` to detect a
proto mismatch between parametrized `tpt_proto_key`
and CLI `tpt_proto`, asserting the new guard raises
`ValueError` with expected msg content.
- replace old commented-out notes with a clearer
explanation of the mismatch foot-gun.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d036ef7d7f)
Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError`
guard — under concurrent pytest sessions the sock-file can already be
reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before
`_shutdown.set()`, deadlocking `wait_for_shutdown()` on
`actor.cancel()`.
Also,
- close each endpoint independently in the finally so one raise doesn't
strand the rest.
- always signal `_shutdown.set()` regardless of remaining ep count.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 2ee44a6fdd)
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.
New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.
Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.
First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
& `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
instance whose write-end is closed, and `drain()` then **spins forever
in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.
Standalone repro:
from trio._core._wakeup_socketpair import WakeupSocketpair
ws = WakeupSocketpair()
ws.write_sock.close()
ws.drain() # spins forever
Patch is one-line — break the drain loop on b'' EOF.
Manifested as two distinct test failures:
- `tests/test_multi_program.py::test_register_duplicate_name` hung at
100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
both threads parked in `epoll_wait`, no TCP connect-back to parent
ever happened.
Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).
Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.
Conc-anal write-ups, including strace + py-spy evidence:
- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.
TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
body's evidence section.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0ef549fadb)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
Inside a new new `tractor.spawn._reap` submod which kicks off providing
post-mortem subactor cleanup primitives, parent-side; consider it the
"sibling" of `tractor._testing._reap` which is the test-harness-oriented
brother mod.
Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454
where `hard_kill()`'s `SIGKILL` bypasses the subactor's
`_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files..
This adds 2 cleanup paths:
- explicit `bind_addrs` (when set at spawn time),
OR
- convention-based reconstruction from `subactor.aid.name + proc.pid`
for the random-self-assign case.
`.spawn.hard_kill()` now invokes the cleanup unconditionally
post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError`
skip.
Future work — authoritative tracking via a per-process
UDS bind-addr registry — documented in module docstring,
deferred to a follow-up PR.
Co-fix: `tractor/spawn/_trio.py::new_proc` already passes
`bind_addrs` + `subactor` to `hard_kill` via prior work
on this branch.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit e9712dcaeb)
(factored: the tractor/_testing/_reap.py harness hunk rides with the testing-harness segment instead)
Add env-var overrides inside `._root.open_root_actor()` so
devs/test-runs can swap the actor-spawn backend or crank
console verbosity *without* touching application code.
In `._root.open_root_actor()`,
- read `TRACTOR_LOGLEVEL` early, overriding any caller-passed
`loglevel` and stashing an `env_ll_report` to emit once the
console log is set up.
- pull the `loglevel` fallback (`or _default_loglevel`) and
`log.get_console_log()` init *up* so the env-var report
routes through tractor's own logger.
- read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed
`start_method` and warn-logging when the env-var clobbers
an explicit caller value.
Wire the same vars through `tests/devx/conftest.py::spawn`,
- request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL`
and `TRACTOR_SPAWN_METHOD` in `os.environ` before each
`pexpect.spawn()` (inherited by the example subproc).
- expand `supported_spawners` to include
`main_thread_forkserver` and `subint_forkserver` bc
example scripts no longer need per-script CLI plumbing.
- pop both vars in fixture teardown so a leaked value can't
re-route a later in-process tractor test's spawn-backend
or loglevel.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 208e7c0926)
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.
Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
`platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
now run unconditionally:
* monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
`ManTracker` subclass (empty `register` / `unregister`
/ `ensure_running`).
* return `partial(SharedMemory, track=False)` for the per-allocation
opt-out.
* belt + suspenders: even if something dodges the wrapper, the
singleton can't talk to the inherited (broken) parent fd.
- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
skip of the unlink-callback; install a `try_unlink()` wrapper that
swallows `FileNotFoundError` (sibling-already-cleaned race in
shared-key setups). Without `mp.resource_tracker` doing it for us, we
own the unlink — `actor. lifetime_stack` is the right place since
tractor already controls actor lifecycle.
- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
module-level skip- list (tests pass now). Inline comment cross-refs
the two `_mp_bs` / `_shm` workarounds.
- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
rewrite — flips status from "open / unresolvable in tractor" to
"resolved, kept as decision record". Adds Resolution section, "Why
this is the right call" rationale (mp tracker is widely criticized;
tractor already owns lifecycle), trade-offs (crash-leaked segments,
lost mp leak warning), verification (7 passed under both
`subint_forkserver` and `trio` backends), and upstream issue links
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit aa3e230926)
(factored: dropped subint_forkserver conc-anal doc update)
(factored: the tests/test_shm.py skip-mark comment hunk rides with the test-hardening segment instead)