Commit Graph

2669 Commits (3d36a06f5e979722fe73422df105f693e5f38bc8)

Author SHA1 Message Date
Gud Boi 9fb1c4ccc0 Mk `--capture` guard CI-aware w/ local warn
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:

- CI (`CI` env-var set): `pytest.exit(rc=2)` on
  mismatch — forces every matrix-row to declare
  `--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
  experiment with `--capture=fd` to validate fixes.

Deats,
- drop `_cap_fd_set` global; add
  `_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
  spawner-name lookup
- move inline comment wall → proper docstring w/
  Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
  `request: pytest.FixtureRequest` and reads
  `request.config.option.capture` instead of the
  `_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
  capture deadlocks)
- `set_fork_aware_capture()` returns the actual
  capture mode str from config, not a hardcoded
  `'sys'`
- lift `import warnings` to module level (was duped
  inside `pytest_configure`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 255c9c3a7c)
2026-06-09 23:24:18 -04:00
Gud Boi c0f5bd2915 Mk per-test reap fixtures opt-in
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.

Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
  per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
  psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e4953851de)
2026-06-09 23:24:18 -04:00
Gud Boi 32a7ead862 Use single f-string per pid in runaway warning
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 086e9f2c07)
2026-06-09 23:24:18 -04:00
Gud Boi 35e8880075 Add per-test runaway-subactor CPU detector to `_reap`
New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).

Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5cf0312c78)
2026-06-09 23:24:18 -04:00
Gud Boi eb89db81a5 Fix `maybe_override_capture` to not get invalid capX fixture names..
(cherry picked from commit 32e89c67ee)
2026-06-09 23:24:18 -04:00
Gud Boi dd1d6cd51e Add fork-aware capture fixtures to `_testing.pytest`
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.

Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
  globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
  args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
  but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
  fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
  wrapper's fixture injection.

Also,
- add `alert_on_finish` session fixture (terminal
  bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
  our plugin eventually however..

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d549c72052)
2026-06-09 23:24:18 -04:00
Gud Boi 82e25c442a Add `pytest_load_initial_conftests()` for `--capture=`
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.

Also,
- load `tractor._testing.pytest` via `-p` in ini
  (bc bootstrapping hooks must register before
  conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
  tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
  per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
  and pop env var on disable.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 61d4525137)
2026-06-09 23:24:18 -04:00
Gud Boi 90c46288ad Add `--uds`/`--uds-only` flags to `tractor-reap`
Wire up `find_orphaned_uds()` + `reap_uds()` from
`_reap` as a new phase-3 UDS sweep in the CLI
script. Opt-in via `--uds` (run after proc reap +
shm) or `--uds-only` (skip other phases).

Also,
- consolidate skip-proc-reap logic into a single
  `skip_proc_reap` bool covering both `--shm-only`
  and `--uds-only`
- extend header docstring + usage examples

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 0996a83655)
2026-06-09 23:24:18 -04:00
Gud Boi 053051535f Add UDS orphan-sweep helpers + reap fixtures to `_reap`
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:

- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
  `reap_uds()` — detect `<name>@<pid>.sock` files under
  `${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
  the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
  lingering subactors, wait, SIGKILL survivors, then sweep orphaned
  UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
  snapshot sock-file dir before/after each test, warn + reap new
  orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
  with known-leaky teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 1cdc7fb302)
2026-06-09 23:24:18 -04:00
Gud Boi 4c50d610e6 Flip back to default `pytest` capture for CI
(cherry picked from commit 22cdf15b73)
2026-06-09 23:24:18 -04:00
Gud Boi 9a844b91f3 Drop global `pytest-timeout` cap from `pyproject.toml`
`timeout = 200` was firing via SIGALRM (the default
`method='signal'`) which synchronously raises `Failed` in
trio's main thread mid-`epoll.poll()`, abandoning trio's
runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half-
installed. EVERY subsequent `trio.run()` in the same pytest
session then bails with
`RuntimeError: Attempted to call run() from inside a run()`.

Empirical impact: a session that hits a single 200s hang
cascades into 30-40 false-positive failures across every
downstream test file that uses `trio.run`. Recent UDS run
saw 1 real timeout (`test_unregistered_err_still_relayed`)
poison 38 sibling tests with cascade-fails — a debugging
nightmare.

Same architectural bug we already documented in
`tests/test_advanced_streaming.py::test_dynamic_pub_sub`
(see its module-level NOTE) — both `pytest-timeout`
enforcement modes are incompatible with trio under fork-
based spawn backends. Now scoped session-wide.

For tests that legitimately need a wall-clock cap, the
canonical pattern is `with trio.fail_after(N):` INSIDE the
test — trio's own `Cancelled` machinery cleanly unwinds
the actor nursery without disturbing global state.

For CI: rely on job-level wall-clock timeouts (e.g. GitHub
Actions `timeout-minutes`) to abort genuinely-stuck suites.

`pyproject.toml` comment block spells this all out so a
future contributor doesn't reach back for `timeout =` and
re-introduce the bug.

ALSO, bump `xonsh` to at least `0.23.0` release.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3c366cac13)

(factored: the xonsh pin/editable-source hunks already landed with the devenv segment)
2026-06-09 23:24:18 -04:00
Gud Boi dcb00e5a8f Return parent `pid: int` from new `reap_subactors_per_test` fixture
(cherry picked from commit f8178df0fd)
2026-06-09 23:24:18 -04:00
Gud Boi 94d233a2f7 Add opt-in `reap_subactors_per_test` fixture
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.

Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...

Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.

Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b376eb0332)
2026-06-09 23:24:18 -04:00
Gud Boi bac291dff4 Fix `_testing.addr.get_rando_addr` cross-process collisions
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".

Switch to per-call `random.randint()` salted with `os.getpid()`
so:

- within one session: two calls return distinct ports — e.g.
  `test_tpt_bind_addrs::bind-subset-reg` now actually gets two
  different reg addrs on the TCP backend (it was silently
  duplicating before),
- across parallel sessions: pid salt biases each process's
  port choices apart, making cross-run collisions
  vanishingly rare.

Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7c5dd4d033)
2026-06-09 23:24:18 -04:00
Gud Boi 7bcb30f6a6 Add `--shm` orphan sweep to `tractor-reap`
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.

Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
  + `reap_shm()` helpers. Match criteria: regular file
  under `/dev/shm`, owned by current uid, AND no live
  proc has it open (mmap'd or fd-held). In-use
  enumeration via `psutil.Process.memory_maps()` +
  `.open_files()` — xplatform, kernel-canonical (same
  answer `lsof` would give), no reliance on
  tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
  `NotImplementedError` outside Linux/FreeBSD bc macOS
  POSIX shm has no fs-visible path (`shm_open` only)
  and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
  process reap) and `--shm-only` (skip process phase)
  flags. `-n` dry-runs both phases. Exit code is `1`
  if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
  the `testing` dep group; lazy-imported in `_reap.py`
  so the process-reap path stays import-clean without
  it.

Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
  (new section 10c) — covers match criteria + the
  preservation guarantee for unrelated apps.
- flip mitigation status in
  `subint_forkserver_mp_shared_memory_issue.md` from
  "could extend `tractor-reap`" to "implemented", with
  a note that callers should still UUID-pin shm keys to
  avoid cross-session collisions.

Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4f12d69b41)
(factored: dropped subint_forkserver conc-anal doc update)
2026-06-09 23:24:18 -04:00
Gud Boi 6de96b508f Add `tractor-reap` CLI + document auto-reap
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:

- orphan-mode (default): finds PPid==1 procs
  with cwd matching repo root + `python` in
  cmdline.
- descendant-mode (`--parent <pid>`): scoped
  sweep under a still-live supervisor.

SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).

Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 6d76b60404)
2026-06-09 23:24:18 -04:00
Gud Boi 34e28cd2e7 Add `_testing._reap` + auto-reap fixture
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.

Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
  `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
  - `find_descendants(parent_pid)` for the in-session case
    (PPid-direct-match while pytest is still alive).
  - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
    reparented to init + `cwd` filter to repo root + `python` cmdline
    filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
  all, poll up to `grace` for exit, SIGKILL any survivors. Returns
  `(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
  `tractor/_testing/pytest.py` — after `yield`, runs
  `find_descendants(os.getpid())` + `reap(...)` so each pytest session
  leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
  commit) for the pytest-died-mid-session case where the in-session
  fixture didn't get to run.

Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
  module-top in `pytest.py` (was inline-imported inside
  `pytest_generate_tests`), and reuse it in
  `pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
  mark arg is a valid spawn-method literal — catches typos at collection
  time.
- inline `# ?TODO` flags running these through the `try_set_backend`
  checker for stronger validation.

Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit eae478f3d5)
2026-06-09 23:24:18 -04:00
Gud Boi 66029c3732 Default `pytest` to use `--capture=sys`
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).

Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
  `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
  capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
  `sys.stderr`).
- KEPT: `pytest -s` debugging behavior.

This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.

Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
  rationale comment cross-ref'ing the post-mortem doc

- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
  from `test_nested_ multierrors` — no longer needed.
  * file-level `pytestmark` covers any residual.

- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
  mark loosened from `strict=True` to `strict=False` + reason rewritten.
  * it passes in isolation but is session-env-pollution sensitive
    (leftover subactor PIDs competing for ports / inheriting harness
    FDs).
  * tolerate both outcomes until suite isolation improves.

- `test_shm`: extend the existing
  `skipon_spawn_backend('subint', ...)` to also skip
  `'subint_forkserver'`.
  * Different root cause from the cancel-cascade class:
    `multiprocessing.SharedMemory`'s `resource_tracker` + internals
    assume fresh- process state, don't survive fork-without-exec cleanly

- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
  (unrelated to forkserver; just a flaky-under-load bump).

- `tractor.spawn._subint_forkserver`: inline comment-only future-work
  marker right before `_actor_child_main()` describing the planned
  conditional stdout/stderr-to-`/dev/null` redirect for cases where
  `--capture=sys` isn't enough (no code change — the redirect logic
  itself is deferred).

EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4c133ab541)
(factored: dropped spawn-backend-only paths: tests/spawn/test_subint_forkserver.py + tractor/spawn/_subint_forkserver.py; the xfail-loosening bullet above no longer applies)

(factored: the test-file mark adjustments ride with the test-hardening segment)
2026-06-09 23:24:18 -04:00
Gud Boi f0716962c6 Add `skipon_spawn_backend` pytest marker
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.

Deats,
- `pytest_configure()` registers the marker via
  `addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
  each collected item with `item.iter_markers(
  name='skipon_spawn_backend')`, checks whether the
  active `--spawn-backend` appears in `mark.args`, and
  if so injects a concrete `pytest.mark.skip(
  reason=...)`. `iter_markers()` makes the decorator
  work at function, class, or module (`pytestmark =
  [...]`) scope transparently.
- First matching mark wins; default reason is
  `f'Borked on --spawn-backend={backend!r}'` if the
  caller doesn't supply one.

Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3b26b59dad)
2026-06-09 23:24:18 -04:00
Gud Boi fe25e2c448 Add global 200s `pytest-timeout`
(cherry picked from commit 5998774535)
2026-06-09 23:24:18 -04:00
Gud Boi a0e2c08119 Wall-cap `test_stale_entry_is_deleted` via `pytest-timeout`
Add a hard process-level wall-clock bound on a test
known to wedge un-Ctrl-C-ably under an in-dev spawn
backend, so an unattended suite run can't hang
indefinitely.

Deats,
- New `testing` dep: `pytest-timeout>=2.3`.
- `test_stale_entry_is_deleted`:
  `@pytest.mark.timeout(3, method='thread')`. The
  `method='thread'` choice is deliberate —
  `method='signal'` routes via `SIGALRM` which can be
  starved by the same GIL-hostage path that drops
  `SIGINT`, so it'd never actually fire in the
  starvation case.

At timeout, `pytest-timeout` hard-kills the pytest
process itself — that's the intended behavior here;
the alternative is the suite never returning.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 189f4e3f72e9f1eda5d24bcbab5743f7e35bd913)
(factored: kept pyproject + tests/discovery/test_registrar.py parts of
 "Wall-cap `subint` audit tests via `pytest-timeout`"; dropped
 tests/test_subint_cancellation.py)
2026-06-09 23:24:18 -04:00
Gud Boi 0e03c6815b Add `supervise_run_process` to `trionics._subproc`
A `trio.Nursery.start()`-style wrapper around
`trio.run_process()` that surfaces rc!=0 errors
deterministically, ALWAYS isolates the parent
controlling-tty, and optionally live-relays the child's
std-streams to `log.<level>` per-line. Suits both
short-lived test-runners + long-lived daemons.

`supervise_run_process()`,
- Deterministic rc!=0: pass `check=False` to `trio`
  and do our OWN post-drain rc-check from the
  supervisor coro body AFTER `own_tn.__aexit__` — NOT
  inside the internal nursery, since that would
  race-cancel the still-draining relay reader and lose
  stderr lines. (Re)build + raise a BARE
  `subprocess.CalledProcessError`: `.stderr=` for
  programmatic callers + an `add_note()`'d
  `|_.stderr:` block for human teardown logs. No
  nursery-eg-wrapped CPE to `collapse_eg` around.
- Parent controlling-tty isolation: `stdin=DEVNULL`
  always, `stdout=DEVNULL` unless relayed/overridden
  (via `stdout=` kwarg w/ `_UNSET` sentinel so explicit
  `None` = inherit still works). Prevents a spawned
  program from clobbering the launching tty's scrollback
  w/ control-seqs.
- Live per-line relay: `relay_stdout=True`/
  `relay_stderr=True` → relayed to `log.<relay_level>`
  (default `'io'`, our custom level 21). Picked to sort
  just above stdlib `INFO`=20 so it shows at usual
  `info`/`devx` levels yet stays separately filterable;
  `runtime`=15 was REJECTED as a default since it'd be
  silently filtered at usual verbosity — footgun for
  daemon supervisors whose whole point is visibility.
  STREAMED, not buffered-until-exit.
- Non-blocking `tn.start()` semantics: live
  `trio.Process` handed up via
  `task_status.started()` immediately (else
  `tn.start()` would block till child exit, losing
  the long-lived-daemon use case). Supervise/relay bg
  tasks run to completion in this coro.
- `**run_process_kwargs` forwarded verbatim (env, shell,
  cwd, start_new_session, executable, ...); MANAGED keys
  (`stdin`/`stdout`/`stderr`/`check`) win on conflict.
- Crash-handling layer intentionally NOT baked in —
  compose `maybe_open_crash_handler()` ON TOP at the
  call-site.

`_relay_stream_lines()` helper,
- Concurrent pipe-drain reader. MANDATORY whenever piping
  w/o `capture_*` since nothing else drains the OS pipe —
  child blocks on `write()` once kernel buf (~64KiB) fills
  → deadlock.
- Modes (combine freely): `emit`-only live relay,
  `accum`-only silent drain+capture (for the CPE note),
  or both. Per-line splitting handles cross-chunk
  residuals + flushes any trailing un-newline-term'd line
  at EOF.

`_add_stderr_note()` helper,
- Attaches an indented `|_.stderr:` note to a CPE via
  `add_note()` for legible rc!=0 reporting at teardown.

Tests (`tests/trionics/test_subproc.py`),
- Hermetic `trio`-only (no actor-runtime).
- `test_stdout_relayed_per_line`: per-line stdout relay.
- `test_parent_tty_isolated`: child fd1 is OUR pipe (no
  `/dev/pts/*`), fd0 pinned to `/dev/null`.
- `test_no_deadlock_on_big_unnewlined_output`: 200KiB
  no-newline output completes under `fail_after(2)` —
  exercises the concurrent drain (without it, the child
  blocks at ~64KiB).
- `test_stderr_relay_and_cpe_rebuild`: rc!=0 w/
  `relay_stderr=True` → bare `CalledProcessError` w/ the
  `.stderr` note + per-line live relay.
- `test_nonrelay_cpe_note`: rc!=0 w/o relay → same
  deterministic post-drain CPE w/ `.stderr` note (silent
  drain+capture path).

Re-export `supervise_run_process` from `tractor.trionics`.

Prompt-IO: ai/prompt-io/claude/20260601T231429Z_0e3e008b_prompt_io.md

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit f595acc76c)
2026-06-09 23:24:18 -04:00
Gud Boi 9c905b390b Add `add_log_level()` factory + register `IO`=21
Follow-up to f595acc7 (`supervise_run_process`) which
called `log.io(...)` for std-stream relay assuming an
`IO=21` level existed. Add the registration via a new
factory + tests covering both the factory and the new
level.

`add_log_level()` factory,
- One call wires the four (otherwise hand-synced) pieces:
  - `CUSTOM_LEVELS[NAME]` — drives the `stacklevel` bump
    in `StackLevelAdapter.log()` + `get_logger()`'s
    per-level audit.
  - `logging.addLevelName()` — stdlib name registration.
  - `STD_PALETTE[NAME]` + `BOLD_PALETTE['bold'][NAME]` —
    color entries consumed by `get_console_log()`'s
    `ColoredFormatter` build.
  - Same-named (lowercase) emit method bound on
    `StackLevelAdapter` so `log.<name>('msg')` works +
    `get_logger()`'s per-level method audit passes.
- Idempotent: re-registering an existing name is a
  no-op-ish refresh that won't clobber an already-bound
  method.
- Method binding uses a default-arg `_level=value` so
  the level int is captured (not late-bound across
  multiple registrations).

`IO=21` level (first user),
- Purple. Used by `tractor.trionics._subproc`'s
  std-stream relay (see f595acc7).
- Value 21 picked to sit just ABOVE stdlib `INFO`=20 so
  it's SHOWN BY DEFAULT at usual `info`/`devx` console
  levels — a `runtime`=15 relay would be silently
  filtered (footgun for daemon supervisors whose whole
  point is visibility). Still distinctly labeled +
  filterable.

Tests (`tests/test_log_sys.py`),
- `test_io_custom_level_registered`: validates the IO
  level is fully wired (`CUSTOM_LEVELS`, `addLevelName`,
  both palettes, `StackLevelAdapter.io()` callable);
  emits a record + sanity-asserts `21 >= INFO(20)`.
- `test_add_log_level_pluggable`: registers a fresh
  `XLVL=19` (cyan) via `add_log_level()`, asserts all
  four wires + the bound `xlog.xlvl()` emit, then
  try/finally cleans up the module-global mutations so
  later `get_logger()` audits don't trip on a
  half-removed level.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7bd7dd50c7)
2026-06-09 23:24:18 -04:00
Gud Boi d55348de20 Add `logspec` leaf-mod Route B follow-up doc
Follow-up note documenting why the deeper "Route B" fix
for `LogSpec`/`apply_logspec()` true per-leaf-MODULE level
control was NOT taken — in favor of the smaller
sub-PACKAGE fix that shipped in 9c36363b.

Doc covers,
- Status: what 9c36363b already gives (per-sub-pkg
  control at any nesting depth, `devx.debug` ≠ `devx`)
  vs. what remains unaddressed (per-leaf-mod levels,
  top-level lib mods like `tractor.to_asyncio` on the
  root logger).
- "Route B" sketch: make logger *identity* the full
  dotted module path; mv the cosmetic leaf-trim out of
  logger-naming into the *formatter's* `{name}`
  rendering.
- 6 breaking-change costs: every logger name changes,
  formatter rewrite, propagation/double-emit surface
  grows, level-inheritance semantics shift,
  `modden`/`piker` contract churn, `get_logger()`
  refactor risk.
- Migration plan if pursued: extract a pure
  `_mk_logger_name()` helper w/ an exhaustive name-shape
  test matrix, swap `get_logger()` to use it for
  identity, swap formatter to use the display string,
  golden-diff rendered headers, coordinate w/
  downstreams.
- "Route A" alternative: a `logging.Filter` keyed on
  `record.module`/`pathname` for per-leaf control w/o
  name churn — lower risk, narrower power.
- Recommendation: defer Route B; prefer Route A if
  per-leaf is needed soon; the shipped sub-PKG fix
  covers the common ask.

Lives under `ai/tooling-todos/` since it's a deferred-
work decision record, not a triage/conc-anal doc.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5b3c2e3762)
2026-06-09 23:24:18 -04:00
Gud Boi 11b9a87077 Fix `get_logger()` collapse of nested sub-pkgs
Strip the trailing `pkg_path` token ONLY when it duplicates the
caller's leaf-*module* name (which the console header already
shows via `{filename}`), instead of blindly dropping the last
token. This keeps genuine, possibly-*nested* sub-PACKAGE parts
addressable as their own sub-loggers.

- detect a true leaf-mod by comparing the caller's `__name__`
  vs `__package__` (a pkg `__init__` has them equal -> its
  trailing token is a real sub-pkg, NOT a leaf to strip).
- `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT
  from a bare `devx` -> `tractor.devx`; the old unconditional
  `pkg_path = subpkg_path` collapsed both to `tractor.devx` and
  silently broke per-sub-pkg level control via the logging-spec.
- `get_logger(__name__)` leaf-strip still works (cosmetic, bc
  the leaf-mod is in the `{filename}` header field).

Also,
- update the `LogSpec` caveat: sub-PACKAGE granularity now
  addressable at ANY depth; leaf *modules* intentionally aren't
  (they're the `{filename}`); top-level mods (eg. `to_asyncio`)
  still emit on the root logger.
- adjust `test_root_pkg_not_duplicated_in_logger_name` to the
  new literal explicit-`name` contract (no leaf-collapse).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9c36363b01)
2026-06-09 23:24:18 -04:00
Gud Boi 7478478038 Lift `--ll`/`--tl` to plugin + `LogSpec` API
Two coupled changes that let downstream projects (eg. `modden`) inherit
the test-harness loglevel plumbing for free via
`tractor._testing.pytest`:

Plugin lift (`tests/conftest.py` → `_testing/pytest.py`),
- mv `pytest_addoption(--ll)`, the `loglevel` autouse
  fixture, and `test_log` fixture out of the test-suite-
  local conftest into the reusable plugin.
- add `--tl`/`--tractor-loglevel` as a DISTINCT flag from
  `--ll`: `--ll` is the consuming-project's OWN app
  loglevel (scoped to its pkg-hierarchy), `--tl` is the
  `tractor.*` runtime loglevel. `--tl` falls back to
  `--ll` when unset (preserves current `tractor`-suite
  behavior).
- add `testing_pkg_name` session fixture (default
  `'tractor'`) — downstream projects override to e.g.
  `'modden'` so `--ll` scopes to their own hierarchy
  instead of `tractor.*`.
- `loglevel` fixture now yields the resolved
  tractor-runtime level (passed to
  `open_root_actor(loglevel=<.>)` by `@tractor_test`)
  AND separately applies `--ll` to the
  `testing_pkg_name` hierarchy when that isn't
  `tractor`. `test_log` scopes the per-test logger to
  `testing_pkg_name`.

`tractor.log` "logging-spec" mini-DSL,
- `LogSpec = str|bool`. Accepted forms:
  - `True` → enable `pkg_name` root at `default_level`
    (fallback `'cancel'`).
  - `False` → no-op.
  - bare level eg. `'info'` → root-logger at that level.
  - `'sub:info,x:cancel'` → per-sub-logger filter-spec;
    each `<name>` is RELATIVE to `pkg_name` (must NOT
    include the pkg-token).
- `parse_logspec()` → `{sublog|None: level}` mapping.
  `None` key = root-logger. Mixed bare-level + filters
  in one spec is rejected w/ a helpful err msg; so is
  embedding the `pkg_name` token in a sub-name.
- `apply_logspec()` → `(primary_level, {name: log})`:
  parses then enables a `colorlog` stderr handler per
  named (sub)logger. Authoritative sub-logger filters
  get `propagate=False` so they don't double-emit
  through a parallel root-level handler.
- !GRANULARITY CAVEAT! sub-logger names match at
  sub-pkg granularity, not leaf-module — so `devx.debug`
  collapses to the same `tractor.devx` logger as a bare
  `devx`, and top-level lib modules (eg.
  `tractor.to_asyncio`) emit under the *root* logger
  rather than a phantom `to_asyncio` child. Documented
  inline on `LogSpec`.

Other,
- `tests/conftest.py` keeps a NOTE pointing to the
  plugin for future-debugging clarity (don't remove
  silently — the lift is the relevant signal).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 19a77708ba)
2026-06-09 23:24:18 -04:00
Gud Boi 9e09dc5eee Default `--ll` to `None` in test harness
Only override `tractor.log._default_loglevel` when
the flag is explicitly passed — lets per-spawn and
per-example `loglevel` kwargs take effect instead
of being clobbered by the hard-coded `'ERROR'`
default.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 72a0465c52)
2026-06-09 23:24:18 -04:00
Gud Boi 3932daaf4f Drop `debug_mode` gate on stackscope SIGUSR1
SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).

Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
  conjunction around the `enable_stack_on_sig()` call;
  now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
  `use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
  of the `if rvs['_debug_mode']:` block to peer-level.
  The `use_greenback` branch stays inside `_debug_mode`
  (pdb-specific).
- Refresh inline comments on both sites to call out the
  infected-`asyncio` "default SIGUSR1 = terminate proc"
  rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3d9c75b6ed)
2026-06-09 23:24:18 -04:00
Gud Boi 4da9c3daa8 Add `use_stackscope` runtime var for subactor init
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.

Deats,
- add `use_stackscope: bool` to `RuntimeVars`
  struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
  successful `stackscope` import, asserts unset
  on `ImportError`
- nest stackscope init under `_debug_mode` gate
  in `Actor.async_main`, check rtvar alongside
  env var
- defer `maybe_init_greenback` import to its own
  `use_greenback` branch

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 48523358cf)
2026-06-09 23:24:18 -04:00
Gud Boi c4ec664bfa Fix `SIGUSR1` tree-dump ordering in `_stackscope`
Factor the sub-actor relay loop out of
`dump_tree_on_sig()` into `_relay_sig_to_subactors()`
and chain both dump + relay in a single
`run_sync_soon` callback (`_dump_then_relay`) so the
parent's task-tree flushes BEFORE any sub receives
the signal — fixes a hierarchical-ordering race
where subs could dump ahead of the parent in the
muxed pty stream.

Also,
- gate file/tty sink writes behind `write_file` +
  `write_tty` params on `dump_task_tree()`.
- use `actor.aid.uid` instead of deprecated `.uid`.
- update `test_shield_pause` expects to match the
  new sequential parent -> relay-log -> sub ordering.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e2b790a70d)
2026-06-09 23:24:18 -04:00
Gud Boi 14ddd49660 Route `stackscope` SIGUSR1 onto trio loop
Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.

Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.

Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).

`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.

Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
  hint in the rendered dump header so users know where
  to find the artifact even when stdio is captured.
- swap the inline `f'     |_{actor}'` line for a
  `_pformat.nest_from_op` rendering of `actor_repr`
  (matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
  branches now note `(trio_token captured: <bool>)`
  so it's obvious from the log whether the full-tree
  path is wired.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 2d4995e08d)
2026-06-09 23:24:18 -04:00
Gud Boi 109313d9de Add `--enable-stackscope` pytest plugin flag
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.

Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
  * exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
    inherit it via environ,
  * installs the handler in pytest itself via
    `enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
  existing `_debug_mode` gate to ALSO fire when
  `TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
  install the same handler at runtime startup.

Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:

- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
  always written) — guaranteed-readable artifact even
  under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
  tty; ignored if device is missing.

Other,
- squelch the benign `RuntimeWarning` ("coroutine method
  'asend'/'athrow' was never awaited") from
  `stackscope._glue`'s import-time async-gen type
  introspection so `--enable-stackscope` setup stays
  quiet.
- log msg in the `_runtime` ImportError branch now
  mentions `--enable-stackscope` alongside debug-mode.

Usage,
  pytest --enable-stackscope -k <hang-test>
  # in another shell, find the pid + signal:
  kill -USR1 <pytest-or-subactor-pid>
  # tail the artifact:
  tail -f /tmp/tractor-stackscope-<pid>.log

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5418f2dc3c)

(factored: only the flag + activation hunks; the surrounding skipon-marker/reap-fixture context rides with the testing-harness segment)
2026-06-09 23:24:18 -04:00
Gud Boi 9500d02ef6 Add `._debug_hangs` to `.devx` for hang triage
Bottle up the diagnostic primitives that actually cracked the
silent mid-suite hangs in the `subint` spawn-backend bringup (issue
there" session has them on the shelf instead of reinventing from
scratch.

Deats,
- `dump_on_hang(seconds, *, path)` — context manager wrapping
  `faulthandler.dump_traceback_later()`. Critical gotcha baked in:
  dumps go to a *file*, not `sys.stderr`, bc pytest's stderr
  capture silently eats the output and you can spend an hour
  convinced you're looking at the wrong thing
- `track_resource_deltas(label, *, writer)` — context manager
  logging per-block `(threading.active_count(),
  len(_interpreters.list_all()))` deltas; quickly rules out
  leak-accumulation theories when a suite progressively worsens (if
  counts don't grow, it's not a leak, look for a race on shared
  cleanup instead)
- `resource_delta_fixture(*, autouse, writer)` — factory returning
  a `pytest` fixture wrapping `track_resource_deltas` per-test; opt
  in by importing into a `conftest.py`. Kept as a factory (not a
  bare fixture) so callers own `autouse` / `writer` wiring

Also,
- export the three names from `tractor.devx`
- dep-free on py<3.13 (swallows `ImportError` for `_interpreters`)
- link back to the provenance in the module docstring (issue #379 /
  commit `26fb820`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 09466a1e9d)
2026-06-09 23:24:18 -04:00
Gud Boi 7f0183d466 Use `is not None` check for peer-connect `event`
Matches the explicit `dict.pop(uid, None)` contract one
line above; same semantics as the prior truthy check.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 0e3e008b0c)
2026-06-09 23:08:40 -04:00
Gud Boi f64b282620 Fix dropped `for/else` re-raise in masking CM
`30e15925` ("Add `start_or_cancel()` to `trionics._taskc`")
inserted `async def start_or_cancel()` — whose body opens its
own col-4 `try:` — immediately before the trailing `else:
raise`. Because the edit was a pure insertion (0 deletions),
the *same* `else: raise` lines were silently REPARENTED: they
used to be the `for exc_match in matching: ... else: raise`
of `maybe_raise_from_masking_exc`, but now bind to
`start_or_cancel`'s `try/except` where they're unreachable
dead code.

Net effect: `maybe_raise_from_masking_exc` lost the `for/else`
re-raise of the un-masked exception, so a masked child
cancellation gets swallowed instead of surfaced.

- restore the `for/else: raise` to `maybe_raise_from_masking_exc`
- drop the now-dead `else: raise` from `start_or_cancel`

Surfaced as 2 deterministic failures in
`test_sigint_closes_lifetime_stack[wait_for_ctx-bg_aio_task-
send_SIGINT_to=child-*]` (the SIGINT-to-child "silent-abandon"
regime). Bisected with `trio` held at `0.29.0`: clean at
`9c36363b` (0/8), broken at `30e15925` (8/8), fixed (0/8).
NOT a `trio` (0.29↔0.33 identical) nor logging-plugin
regression.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 325574cc07)
2026-06-09 23:08:40 -04:00
Gud Boi 549fc26516 Add `start_or_cancel()` to `trionics._taskc`
Wrapper around `trio.Nursery.start()` that DOESN'T mask
out-of-band cancellation as a lossy startup failure.
Picks the right re-raise: ambient `Cancelled` when
present, the genuine startup-protocol `RuntimeError`
otherwise.

The problem,
- `trio.Nursery.start()` raises a generic
  `RuntimeError("child exited without calling
  task_status.started()")` whenever the started task
  exits BEFORE calling `task_status.started()` —
  INCLUDING the common case where the child was
  cancelled out-of-band by an *ancestor* cancel-scope
  erroring/cancelling.
- In that case the original `trio.Cancelled` is
  swallowed and the caller is left w/ an opaque,
  root-cause-detached `RuntimeError`.

The fix,
- Catch the "...started" RTE.
- `await trio.lowlevel.checkpoint_if_cancelled()` —
  re-raises the in-flight `Cancelled` IFF we're under
  effective cancellation (ancestor-inclusive), carrying
  trio's auto-generated reason which points at the true
  root exc.
- If we're NOT cancelled the `checkpoint_if_cancelled()`
  is a cheap no-op and we fall through to re-raise the
  genuine startup-protocol RTE.

Re-export from `tractor.trionics` so callers don't have
to reach into `_taskc`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 30e15925ba)
2026-06-09 23:08:40 -04:00
Gud Boi 23cc1413dd Add `maybe_signal_aio_task()` + cause-chain guard
Factor the "deliver an exc to a running aio task" pattern out of
`translate_aio_errors()` + `open_channel_from()` into a shared
`maybe_signal_aio_task()` helper. Add a cause-chain matrix comment
+ relay-echo guard so the final-raise block can't cycle
  `trio_err.__cause__` back onto its own derivative relay.

`maybe_signal_aio_task()`,
- Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT
  `aio_task.set_exception()` which on py3.13+ ALWAYS raises
  `RuntimeError("Task does not support set_exception")` (dead code as
  a relay mechanism).
- Returns `(delivered: bool, report: str)`. Caller uses `delivered` to
  flip `wait_on_aio_task` when delivery failed (avoids hanging on
  `_aio_task_complete.wait()`).
- `pre_captured_fut=`: required when the caller crosses a trio
  checkpoint between capturing `_fut_waiter` and invoking the helper.
  `Task._wakeup` clears `_fut_waiter = None` so re-reading
  post-checkpoint loses the ref even though the exc is still in-flight
  on the (now-`done()`) original fut.
- `cause=`: sets `exc.__cause__ = cause` so the relay carries
  a "trio_err -> caused -> relay" chain through `set_exception()`
  → `Task._wakeup` → coro raise → `wait_on_coro_final_result`
  → `signal_trio_when_done` → `task.result()`-raise.
- `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the
  narrow case where `_fut_waiter is None` AND task is runnable (sitting
  in asyncio's ready queue, not parked on a poke-able future). NEVER
  cancels when `_fut_waiter` carries an in-flight exc — that would race
  + mask the real terminating exc.

`translate_aio_errors()`,
- Replace the two ad-hoc `_fut_waiter.set_exception()`
  / `aio_task.set_exception()` call sites w/ the helper.
- Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown
  `trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref).
- New "cross-loop cause-chain matrix" comment block on the final-raise
  — tabulates every `(trio_err, aio_err, trio_to_raise)` combo into
  exactly one terminal `raise X [from Y]` or early `return`. Covers the
  sibling `signal_trio_when_done()` resolution + the relay-echo
  INVARIANT.
- New relay-echo guard: if `aio_err` is one of OUR OWN signals
  (`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is
  trio_err`, raise the bare `trio_err` instead of `trio_err from
  aio_err` (which would CYCLE the cause chain since the relay was itself
  caused-by `trio_err`).
- Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT
  EXCEPT or this WILL HANG" warning — the helper handles the failure
  path explicitly via `delivered=False` → `wait_on_aio_task = False`.
- Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and
  the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback
  shows the real root.

`open_channel_from()`,
- Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?"
  + `tractor.pause(shield=True)` panic branch.
- Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as
  `cause=` — non-`None` only when the `try` body raised (graceful exit
  → None).

Other,
- Top-level import: `sys` (for `sys.exc_info()`).
- `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await
  trio_done_fute` next to the shielded version — exploratory note for
  the longstanding "why is `.shield()` needed?" TODO.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit acd1cbeec4)
2026-06-09 23:08:40 -04:00
Gud Boi d78962c319 Escalate cancel-ack timeouts to `proc.terminate()`
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:

  graceful cancel-req -> bounded wait -> hard-kill

Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
  When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
  of the legacy DEBUG-log + return-`False` path. Default stays `False`
  for callers that handle their own escalation (e.g.
  `_spawn.soft_kill()` polling `proc.poll()`).

- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
  cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
  (SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
  waiting on `proc.poll()`.

- replace `tn.start_soon(portal.cancel_actor)` in
  `ActorNursery.cancel()` with the helper.

Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
  REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
  opt-in).

ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.

Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 34f333a026)
2026-06-09 23:08:40 -04:00
Gud Boi 703973b7c4 Add `ActorTooSlowError` for cancel-cascade timeouts
Distinct from `trio.TooSlowError` so that existing
`except trio.TooSlowError:` blocks don't silently
mask actor-cancel timeouts — these must propagate
to let a supervisor escalate to
`proc.terminate()` per SC-discipline:

  graceful cancel-req -> bounded wait -> hard-kill

Motivated by #subint_forkserver dup-name hang
where `Portal.cancel_actor()` silently swallowed
the timeout and the supervisor never escalated,
leaving a same-named sibling subactor parked
forever.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 38ffb875bd)
2026-06-09 23:08:40 -04:00
Gud Boi f6c9665bf1 Tidy proto-guard `ValueError` fmt in `open_root_actor()`
Pre-compute `mismatch_lines` str instead of `+`-concat
inside the f-string raise site; slightly easier to read
and avoids the `+ '\n\n'` continuation.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5cd06810db)
2026-06-09 23:08:40 -04:00
Gud Boi d3b1a68ff9 Add `enable_transports`/`registry_addrs` proto guard
Raise `ValueError` from `open_root_actor()` when any
`registry_addrs` entry uses a transport proto not in
`enable_transports` — historically this caused a
silent indefinite hang during the registrar handshake
(the actor could never connect to register/discover).

Also,
- update `test_root_passes_tpt_to_sub` to detect a
  proto mismatch between parametrized `tpt_proto_key`
  and CLI `tpt_proto`, asserting the new guard raises
  `ValueError` with expected msg content.
- replace old commented-out notes with a clearer
  explanation of the mismatch foot-gun.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d036ef7d7f)
2026-06-09 23:08:40 -04:00
Gud Boi bdf0fb0a2e Fix shutdown deadlock on UDS unlink race
Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError`
guard — under concurrent pytest sessions the sock-file can already be
reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before
`_shutdown.set()`, deadlocking `wait_for_shutdown()` on
`actor.cancel()`.

Also,
- close each endpoint independently in the finally so one raise doesn't
  strand the rest.
- always signal `_shutdown.set()` regardless of remaining ep count.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 2ee44a6fdd)
2026-06-09 23:07:44 -04:00
Gud Boi aaf696ba4c Drop global mutation of `_PROC_SPAWN_WAIT`
In top level `daemon`-fixture that is..

Use a local `bg_daemon_spawn_delay` instead of
mutating the module-level `_PROC_SPAWN_WAIT` —
previously each `daemon` fixture invocation would
permanently add 1.6s (UDS) or 1s (CI) to the
global, inflating delays across the session.

Also, emit a `test_log.warning()` when verbose
loglevel is silently reduced to `'info'`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit c4885f9d99)
2026-06-09 23:07:44 -04:00
Gud Boi 2d8bcbb1c4 Add `tractor.trionics.patches` subpkg + first fix
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.

New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.

Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.

First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
  ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
  & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
  instance whose write-end is closed, and `drain()` then **spins forever
  in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.

Standalone repro:

    from trio._core._wakeup_socketpair import WakeupSocketpair
    ws = WakeupSocketpair()
    ws.write_sock.close()
    ws.drain()  # spins forever

Patch is one-line — break the drain loop on b'' EOF.

Manifested as two distinct test failures:

- `tests/test_multi_program.py::test_register_duplicate_name` hung at
  100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
  deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
  both threads parked in `epoll_wait`, no TCP connect-back to parent
  ever happened.

Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).

Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.

Conc-anal write-ups, including strace + py-spy evidence:

- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`

Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.

TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
      body's evidence section.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 0ef549fadb)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
2026-06-09 23:07:44 -04:00
Gud Boi 2c0fefef61 Add `tractor.spawn._reap.unlink_uds_bind_addrs()`
Inside a new new `tractor.spawn._reap` submod which kicks off providing
post-mortem subactor cleanup primitives, parent-side; consider it the
"sibling" of `tractor._testing._reap` which is the test-harness-oriented
brother mod.

Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454
where `hard_kill()`'s `SIGKILL` bypasses the subactor's
`_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files..

This adds 2 cleanup paths:
- explicit `bind_addrs` (when set at spawn time),
OR
- convention-based reconstruction from `subactor.aid.name + proc.pid`
  for the random-self-assign case.

`.spawn.hard_kill()` now invokes the cleanup unconditionally
post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError`
skip.

Future work — authoritative tracking via a per-process
UDS bind-addr registry — documented in module docstring,
deferred to a follow-up PR.

Co-fix: `tractor/spawn/_trio.py::new_proc` already passes
`bind_addrs` + `subactor` to `hard_kill` via prior work
on this branch.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e9712dcaeb)

(factored: the tractor/_testing/_reap.py harness hunk rides with the testing-harness segment instead)
2026-06-09 23:07:44 -04:00
Gud Boi 400ed77ab1 Allow per-call `start_method`/`loglevel` overrides
In `tests/devx/conftest.py::spawn`, refactor the
fixture-internal closures so consumer tests can pass
explicit `start_method`/`loglevel` to each `_spawn()`
invocation rather than only inheriting the fixture-
scoped parametrize values.

Deats,
- promote `set_spawn_method()` and `set_loglevel()`
  to take their respective values as fn params (vs
  closing over the fixture-scope vars).
- give `_spawn()` `start_method=start_method` and
  `loglevel: str|None = None` kwargs so callers
  override one-off without re-parametrizing the
  suite. NOTE: this drops the implicit fixture-
  scoped `loglevel` forward — `_spawn()` callers
  now must pass `loglevel=...` explicitly.
- TODO: figure out how `--ll <level>` should map to
  the default (currently `None` → uses env-var or
  tractor default).
- add a docstring to `_spawn()` so its role as the
  consumer-facing closure is obvious from `help()`.

Also,
- `assert_before()` now returns the `.before` output
  on success (was `None`); add a one-line docstring
  describing the new return contract.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 486249d74f)
2026-06-09 23:07:06 -04:00
Gud Boi c794d5ef44 Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars
Add env-var overrides inside `._root.open_root_actor()` so
devs/test-runs can swap the actor-spawn backend or crank
console verbosity *without* touching application code.

In `._root.open_root_actor()`,
- read `TRACTOR_LOGLEVEL` early, overriding any caller-passed
  `loglevel` and stashing an `env_ll_report` to emit once the
  console log is set up.
- pull the `loglevel` fallback (`or _default_loglevel`) and
  `log.get_console_log()` init *up* so the env-var report
  routes through tractor's own logger.
- read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed
  `start_method` and warn-logging when the env-var clobbers
  an explicit caller value.

Wire the same vars through `tests/devx/conftest.py::spawn`,
- request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL`
  and `TRACTOR_SPAWN_METHOD` in `os.environ` before each
  `pexpect.spawn()` (inherited by the example subproc).
- expand `supported_spawners` to include
  `main_thread_forkserver` and `subint_forkserver` bc
  example scripts no longer need per-script CLI plumbing.
- pop both vars in fixture teardown so a leaked value can't
  re-route a later in-process tractor test's spawn-backend
  or loglevel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 208e7c0926)
2026-06-09 23:07:06 -04:00
Gud Boi 75bb371d5d Fix `SharedMemory` under `subint_forkserver`
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.

Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
  `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
  now run unconditionally:
  * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
    `ManTracker` subclass (empty `register` / `unregister`
    / `ensure_running`).
  * return `partial(SharedMemory, track=False)` for the per-allocation
    opt-out.
  * belt + suspenders: even if something dodges the wrapper, the
    singleton can't talk to the inherited (broken) parent fd.

- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
  skip of the unlink-callback; install a `try_unlink()` wrapper that
  swallows `FileNotFoundError` (sibling-already-cleaned race in
  shared-key setups). Without `mp.resource_tracker` doing it for us, we
  own the unlink — `actor. lifetime_stack` is the right place since
  tractor already controls actor lifecycle.

- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
  module-level skip- list (tests pass now). Inline comment cross-refs
  the two `_mp_bs` / `_shm` workarounds.

- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
  rewrite — flips status from "open / unresolvable in tractor" to
  "resolved, kept as decision record". Adds Resolution section, "Why
  this is the right call" rationale (mp tracker is widely criticized;
  tractor already owns lifecycle), trade-offs (crash-leaked segments,
  lost mp leak warning), verification (7 passed under both
  `subint_forkserver` and `trio` backends), and upstream issue links

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit aa3e230926)
(factored: dropped subint_forkserver conc-anal doc update)

(factored: the tests/test_shm.py skip-mark comment hunk rides with the test-hardening segment instead)
2026-06-09 23:06:42 -04:00
Gud Boi 9604569584 Bound peer-clear wait in `async_main` finally
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e312a68d8a)
(factored: dropped subint_forkserver conc-anal doc update)
2026-06-09 23:06:09 -04:00
Gud Boi 332ac30636 Break parent-chan shield during teardown
Completes the nested-cancel deadlock fix started in
0cd0b633 (fork-child FD scrub) and fe540d02 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8ac3dfeb85)
2026-06-09 23:06:09 -04:00