Commit Graph

1494 Commits (084f0fc404bf470454c5017a88762c6afeb50c8e)

Author SHA1 Message Date
Gud Boi c797bcb783 Address Copilot review nits on PR #462
All 5 flagged items were valid (4 real bugs + 1 dead assert),

- fix an inverted `sys.version_info < (3, 14)` guard in
  `ipc._linux` — the "`cffi` has no 3.14 support" import note now
  fires on 3.14+ (where it applies) instead of on older pys.
- use `os.environ.get('PYTHON_COLORS')` in the `sync_bp` example
  so it doesn't `KeyError` when run outside the test harness.
- correct `dump_task_tree()`'s docstring: the `/tmp` + `/dev/tty`
  tee is gated on `write_file`/`write_tty`, not "unconditional".
- tidy the `ActorTooSlowError` message spacing in `cancel_actor`.
- replace a tautological `applied is True or applied is False` in
  `test_patches` with `isinstance(applied, bool)` (the value is
  order-dependent across the module).

Review: PR #462 (copilot-pull-request-reviewer)
https://github.com/goodboy/tractor/pull/462#pullrequestreview-4527179852

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-18 13:37:55 -04:00
Gud Boi b0ac681245 Fix dead-code env-var override notice in `open_root_actor`
The `TRACTOR_LOGLEVEL`/`TRACTOR_SPAWN_METHOD` override-notice
branches were unreachable: `loglevel`/`start_method` were
reassigned to the env value BEFORE the `!=` compare, so the
"OVERRIDES caller-passed" message never fired. Capture the
caller value first, then compare. Rel. `208e7c09`/`d4eac06d`
"Honor env-vars" (`trionics.start_or_cancel`); surfaced by
`/code-review high` on #462.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 19:46:43 -04:00
Gud Boi fdac157d3d Harden cancel-ack hard-kill escalation
Two defensive fixes around the `Portal.cancel_actor()` +
`_try_cancel_then_kill()` escalation from `34f333a0`
"Escalate cancel-ack timeouts to `proc.terminate()`" (the
`trionics.start_or_cancel` follow-up); surfaced by
`/code-review high` on #462,

- guard `proc.terminate()` for backends whose `proc` slot
  isn't a `Process` — the future `subint` backend stores an
  `int` interp-id, so escalation would `AttributeError`
  instead of hard-killing; now it logs + no-ops.
- swap `assert cs.cancelled_caught` for an
  `if cs.cancelled_caught and raise_on_timeout:` guard so an
  unexpected shielded-scope exit returns a soft `False`
  rather than crashing `cancel_actor()` mid-teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 19:46:04 -04:00
Gud Boi d28173f4c0 Make SIGUSR1 `stackscope` dumps actually work
Two fixes to the hang-debug SIGUSR1 task-tree dump path,
surfaced by `/code-review high` on #462,

- re-add `_debug_mode` to the sub-actor handler-install gate
  in `_runtime.py`. Dropping it (rel. `3a386ba5`/`3d9c75b6`
  "Drop debug_mode gate", from the `custom_log_levels_api`
  follow-up) was meant to *also* enable non-pdb runs, but
  nothing sets `use_stackscope` from `debug_mode`, so
  debug-mode subs were left with NO handler — and the default
  SIGUSR1 disposition then *kills* them. Now additive:
  `_debug_mode OR use_stackscope OR env`.
- pass `write_file=True` at both `dump_task_tree()` SIGUSR1
  call sites so the advertised `/tmp/tractor-stackscope-<pid>`
  `.log` tee is actually written (was dead under
  `--capture=fd`). Matches `1b1ef10a` "Re-enable writing
  `stackscope` to file by default"; param from `0df90500`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 19:45:21 -04:00
Gud Boi f08a7d52b5 Drop stray `breakpoint()` in `RuntimeVars.__setattr__`
Left-over debug trap from the `_runtime_vars` pure get/set
refactor — it fired on *every* struct-form rt-var write (e.g.
via `.update()`), hanging any non-tty / CI / forked actor on
`pdb` stdin.

Surfaced by a `/code-review high` pass on #462.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 19:44:34 -04:00
Gud Boi 6d45581910 Add autouse fixture to reset `_runtime_vars` per-test
`open_root_actor()` writes `_enable_tpts` (and friends) into
the process-global `_state._runtime_vars` dict but nothing
resets it on actor teardown. Under the in-proc `pytest`
launchpad a uds-using test leaks `_enable_tpts=['uds']` into
a sibling tcp test, tripping the
`registry_addrs`×`enable_transports` proto-guard in
`open_root_actor()` with a `ValueError`.

New `_reset_runtime_vars` fixture snapshots + restores the
dict around every test so no runtime-var state crosses a
test boundary.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 17:39:44 -04:00
Gud Boi b6bf865f5e Fix `get_logger()` collapse of nested sub-pkgs
Strip the trailing `pkg_path` token ONLY when it duplicates the
caller's leaf-*module* name (which the console header already
shows via `{filename}`), instead of blindly dropping the last
token. This keeps genuine, possibly-*nested* sub-PACKAGE parts
addressable as their own sub-loggers.

- detect a true leaf-mod by comparing the caller's `__name__`
  vs `__package__` (a pkg `__init__` has them equal -> its
  trailing token is a real sub-pkg, NOT a leaf to strip).
- `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT
  from a bare `devx` -> `tractor.devx`; the old unconditional
  `pkg_path = subpkg_path` collapsed both to `tractor.devx` and
  silently broke per-sub-pkg level control via the logging-spec.
- `get_logger(__name__)` leaf-strip still works (cosmetic, bc
  the leaf-mod is in the `{filename}` header field).

Also,
- update the `LogSpec` caveat: sub-PACKAGE granularity now
  addressable at ANY depth; leaf *modules* intentionally aren't
  (they're the `{filename}`); top-level mods (eg. `to_asyncio`)
  still emit on the root logger.
- adjust `test_root_pkg_not_duplicated_in_logger_name` to the
  new literal explicit-`name` contract (no leaf-collapse).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9c36363b01)
2026-06-17 17:39:44 -04:00
Gud Boi 8bb9df9e06 Lift `--ll`/`--tl` to plugin + `LogSpec` API
Two coupled changes that let downstream projects (eg. `modden`) inherit
the test-harness loglevel plumbing for free via
`tractor._testing.pytest`:

Plugin lift (`tests/conftest.py` → `_testing/pytest.py`),
- mv `pytest_addoption(--ll)`, the `loglevel` autouse
  fixture, and `test_log` fixture out of the test-suite-
  local conftest into the reusable plugin.
- add `--tl`/`--tractor-loglevel` as a DISTINCT flag from
  `--ll`: `--ll` is the consuming-project's OWN app
  loglevel (scoped to its pkg-hierarchy), `--tl` is the
  `tractor.*` runtime loglevel. `--tl` falls back to
  `--ll` when unset (preserves current `tractor`-suite
  behavior).
- add `testing_pkg_name` session fixture (default
  `'tractor'`) — downstream projects override to e.g.
  `'modden'` so `--ll` scopes to their own hierarchy
  instead of `tractor.*`.
- `loglevel` fixture now yields the resolved
  tractor-runtime level (passed to
  `open_root_actor(loglevel=<.>)` by `@tractor_test`)
  AND separately applies `--ll` to the
  `testing_pkg_name` hierarchy when that isn't
  `tractor`. `test_log` scopes the per-test logger to
  `testing_pkg_name`.

`tractor.log` "logging-spec" mini-DSL,
- `LogSpec = str|bool`. Accepted forms:
  - `True` → enable `pkg_name` root at `default_level`
    (fallback `'cancel'`).
  - `False` → no-op.
  - bare level eg. `'info'` → root-logger at that level.
  - `'sub:info,x:cancel'` → per-sub-logger filter-spec;
    each `<name>` is RELATIVE to `pkg_name` (must NOT
    include the pkg-token).
- `parse_logspec()` → `{sublog|None: level}` mapping.
  `None` key = root-logger. Mixed bare-level + filters
  in one spec is rejected w/ a helpful err msg; so is
  embedding the `pkg_name` token in a sub-name.
- `apply_logspec()` → `(primary_level, {name: log})`:
  parses then enables a `colorlog` stderr handler per
  named (sub)logger. Authoritative sub-logger filters
  get `propagate=False` so they don't double-emit
  through a parallel root-level handler.
- !GRANULARITY CAVEAT! sub-logger names match at
  sub-pkg granularity, not leaf-module — so `devx.debug`
  collapses to the same `tractor.devx` logger as a bare
  `devx`, and top-level lib modules (eg.
  `tractor.to_asyncio`) emit under the *root* logger
  rather than a phantom `to_asyncio` child. Documented
  inline on `LogSpec`.

Other,
- `tests/conftest.py` keeps a NOTE pointing to the
  plugin for future-debugging clarity (don't remove
  silently — the lift is the relevant signal).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 19a77708ba)
2026-06-17 17:39:44 -04:00
Gud Boi 944262d8a6 Add `maybe_signal_aio_task()` + cause-chain guard
Factor the "deliver an exc to a running aio task" pattern out of
`translate_aio_errors()` + `open_channel_from()` into a shared
`maybe_signal_aio_task()` helper. Add a cause-chain matrix comment
+ relay-echo guard so the final-raise block can't cycle
  `trio_err.__cause__` back onto its own derivative relay.

`maybe_signal_aio_task()`,
- Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT
  `aio_task.set_exception()` which on py3.13+ ALWAYS raises
  `RuntimeError("Task does not support set_exception")` (dead code as
  a relay mechanism).
- Returns `(delivered: bool, report: str)`. Caller uses `delivered` to
  flip `wait_on_aio_task` when delivery failed (avoids hanging on
  `_aio_task_complete.wait()`).
- `pre_captured_fut=`: required when the caller crosses a trio
  checkpoint between capturing `_fut_waiter` and invoking the helper.
  `Task._wakeup` clears `_fut_waiter = None` so re-reading
  post-checkpoint loses the ref even though the exc is still in-flight
  on the (now-`done()`) original fut.
- `cause=`: sets `exc.__cause__ = cause` so the relay carries
  a "trio_err -> caused -> relay" chain through `set_exception()`
  → `Task._wakeup` → coro raise → `wait_on_coro_final_result`
  → `signal_trio_when_done` → `task.result()`-raise.
- `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the
  narrow case where `_fut_waiter is None` AND task is runnable (sitting
  in asyncio's ready queue, not parked on a poke-able future). NEVER
  cancels when `_fut_waiter` carries an in-flight exc — that would race
  + mask the real terminating exc.

`translate_aio_errors()`,
- Replace the two ad-hoc `_fut_waiter.set_exception()`
  / `aio_task.set_exception()` call sites w/ the helper.
- Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown
  `trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref).
- New "cross-loop cause-chain matrix" comment block on the final-raise
  — tabulates every `(trio_err, aio_err, trio_to_raise)` combo into
  exactly one terminal `raise X [from Y]` or early `return`. Covers the
  sibling `signal_trio_when_done()` resolution + the relay-echo
  INVARIANT.
- New relay-echo guard: if `aio_err` is one of OUR OWN signals
  (`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is
  trio_err`, raise the bare `trio_err` instead of `trio_err from
  aio_err` (which would CYCLE the cause chain since the relay was itself
  caused-by `trio_err`).
- Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT
  EXCEPT or this WILL HANG" warning — the helper handles the failure
  path explicitly via `delivered=False` → `wait_on_aio_task = False`.
- Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and
  the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback
  shows the real root.

`open_channel_from()`,
- Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?"
  + `tractor.pause(shield=True)` panic branch.
- Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as
  `cause=` — non-`None` only when the `try` body raised (graceful exit
  → None).

Other,
- Top-level import: `sys` (for `sys.exc_info()`).
- `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await
  trio_done_fute` next to the shielded version — exploratory note for
  the longstanding "why is `.shield()` needed?" TODO.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit acd1cbeec4)
2026-06-17 17:39:44 -04:00
Gud Boi b57d095204 Drop `debug_mode` gate on stackscope SIGUSR1
SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).

Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
  conjunction around the `enable_stack_on_sig()` call;
  now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
  `use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
  of the `if rvs['_debug_mode']:` block to peer-level.
  The `use_greenback` branch stays inside `_debug_mode`
  (pdb-specific).
- Refresh inline comments on both sites to call out the
  infected-`asyncio` "default SIGUSR1 = terminate proc"
  rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3d9c75b6ed)
2026-06-17 17:39:44 -04:00
Gud Boi fad1227d7c Filter `_find_tractor_strays` by ppid disposition
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).

Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.

Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
  short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
  filter rule + its rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit a6d4ac3aac)
2026-06-17 17:39:44 -04:00
Gud Boi 98522661d6 Add init-adopted orphan reap to `reap_subactors_per_test`
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 01ce2857ea)
2026-06-17 17:39:44 -04:00
Gud Boi dfe853e159 Add subtree-walk to `reap()` for full actor-tree teardown
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).

Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8de684f5de)
2026-06-17 17:39:44 -04:00
Gud Boi f0a1971814 Add hang-snapshot session index to pytest summary
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
  populated by `_do_capture_snapshot()` on each successful dump;
  add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
  prints all captured snapshot dirs at end-of-session so paths
  don't get buried in scrollback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit fb87c36263)
2026-06-17 17:39:44 -04:00
Gud Boi 46c1147f6e Add stray-proc scan + refine `_testing.trace` capture
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
  `tractor._child` procs NOT in the walk's `seen` set — surfaces
  ghost subactor trees from prior test runs (cross-test launchpad
  contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
  into `_classify_walk()` closure, walk stray roots as additional
  trees, emit stray-root summary in header. Also: `tractor._child`
  procs reparented to init are now always classified as orphans
  regardless of cgroup-slice (leaked subactor ≠ desktop-launched
  app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
  `--capture=sys` redirection so snapshot paths always land on the
  real terminal
- `fail_after_w_trace()`: capture diag snapshot on
  non-`TooSlowError` exceptions when the `fail_after` scope's
  cancel had already fired (e.g. nursery wraps `Cancelled` into a
  `BaseExceptionGroup` that escapes before `TooSlowError` can be
  raised).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3a243a1fd4)
2026-06-17 17:39:44 -04:00
Gud Boi da2d5bb3d2 Mv core impl `tractor_diag.xsh` to `_testing.trace`
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.

Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
  hung-state dumper, bindspace scanner, `dump_all()` snapshot
  archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
  (trio `fail_after` + auto-snapshot on `TooSlowError`),
  `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
  `SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
  import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
  `_testing.trace` — drops ~627 lines of inline impl. Add
  `acli.dump_all` alias for full snapshot-bundle CLI access.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7509e313ff)
2026-06-17 17:39:44 -04:00
Gud Boi aced458350 Fix `is_forking_spawner` fixture to call helper fn
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 83b6a3373a)
2026-06-17 17:39:44 -04:00
Gud Boi 9f9a536fa9 Adjust legacy streaming test timeouts for fork+UDS
Forking spawner + UDS transport has different timing
vs `trio_proc` — streaming example completes faster
in some cases, slower in others depending on fork
overhead + sock setup.

Deats,
- add `expect_cancel` param to `cancel_after()`, raise
  `ActorTooSlowError` when cancel scope fires unexpectedly instead of
  silently returning `None`.
- `time_quad_ex` fixture: bump timeout +1 for forking+UDS, explicit
  `ActorTooSlowError` on `None` result instead of bare `assert results`.
- `test_not_fast_enough_quad`: `xfail` for forking+UDS being "too fast"
  (cancel doesn't fire bc streaming finishes before delay).
- add `is_forking_spawner`, `tpt_proto` fixture params throughout.

Also,
- `_testing/pytest.py`: widen `start_method` parametrize and
  `is_forking_spawner` fixture to `scope='session'`.
- `"""` -> `'''` docstring style throughout.
- hoist `_non_linux` to module scope (was redefined locally in two
  places).
- type hints, kwarg-style `partial()` calls.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d3cbc92751)
2026-06-17 17:39:44 -04:00
Gud Boi 3c6ea8009b Add `_is_tractor_subactor()`, cgroup-aware `ptree`
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.

Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
  `tractor._child` markers, falls back to `/proc/<pid>/comm` for
  zombie-resilient detection (kernel preserves `comm` past exit
  until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
  — the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
  `/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
  `_is_tractor_subactor()` for intrinsic sub-actor ID instead of
  cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
  zombie-state sub-actors

Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
  distinguish `system.slice` services vs `user.slice` desktop apps
  from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
  `user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
  tractor importable from active venv

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 522b57570b)
2026-06-17 17:39:44 -04:00
Gud Boi 338f0a1463 Add per-actor `setproctitle` via `devx._proctitle`
New `tractor.devx._proctitle` mod sets each
sub-actor's `argv[0]` (and kernel `comm`) to
`tractor[<aid.reprol()>]` — e.g.
`tractor[doggy@1027301b]` — so `ps`/`top`/`htop`
and `acli.pytree`/reaper tooling can identify
actors at a glance without parsing full cmdlines.

Deats,
- `set_actor_proctitle()` wraps the `setproctitle`
  pkg with `ImportError` guard; optional at runtime
  but listed in `pyproject.toml` so default installs
  benefit.
- called early in `_child._actor_child_main()` after
  `Actor` construction, before `_trio_main()` entry.
- tests in `tests/devx/test_proctitle.py`: format
  unit test, `/proc/{cmdline,comm}` integration
  test, negative detection test.

Resolves #457

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d60245777e)
2026-06-17 17:39:44 -04:00
Gud Boi 9426fea5bd Escalate cancel-ack timeouts to `proc.terminate()`
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:

  graceful cancel-req -> bounded wait -> hard-kill

Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
  When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
  of the legacy DEBUG-log + return-`False` path. Default stays `False`
  for callers that handle their own escalation (e.g.
  `_spawn.soft_kill()` polling `proc.poll()`).

- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
  cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
  (SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
  waiting on `proc.poll()`.

- replace `tn.start_soon(portal.cancel_actor)` in
  `ActorNursery.cancel()` with the helper.

Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
  REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
  opt-in).

ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.

Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 34f333a026)
2026-06-17 17:39:44 -04:00
Gud Boi e9b63e8569 Add `ActorTooSlowError` for cancel-cascade timeouts
Distinct from `trio.TooSlowError` so that existing
`except trio.TooSlowError:` blocks don't silently
mask actor-cancel timeouts — these must propagate
to let a supervisor escalate to
`proc.terminate()` per SC-discipline:

  graceful cancel-req -> bounded wait -> hard-kill

Motivated by #subint_forkserver dup-name hang
where `Portal.cancel_actor()` silently swallowed
the timeout and the supervisor never escalated,
leaving a same-named sibling subactor parked
forever.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 38ffb875bd)
2026-06-17 17:39:44 -04:00
Gud Boi 4c600dc528 Tidy proto-guard `ValueError` fmt in `open_root_actor()`
Pre-compute `mismatch_lines` str instead of `+`-concat
inside the f-string raise site; slightly easier to read
and avoids the `+ '\n\n'` continuation.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5cd06810db)
2026-06-17 17:39:44 -04:00
Gud Boi 254a46c345 Mk `--capture` guard CI-aware w/ local warn
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:

- CI (`CI` env-var set): `pytest.exit(rc=2)` on
  mismatch — forces every matrix-row to declare
  `--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
  experiment with `--capture=fd` to validate fixes.

Deats,
- drop `_cap_fd_set` global; add
  `_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
  spawner-name lookup
- move inline comment wall → proper docstring w/
  Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
  `request: pytest.FixtureRequest` and reads
  `request.config.option.capture` instead of the
  `_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
  capture deadlocks)
- `set_fork_aware_capture()` returns the actual
  capture mode str from config, not a hardcoded
  `'sys'`
- lift `import warnings` to module level (was duped
  inside `pytest_configure`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 255c9c3a7c)
2026-06-17 17:39:44 -04:00
Gud Boi a0456fece4 Add `enable_transports`/`registry_addrs` proto guard
Raise `ValueError` from `open_root_actor()` when any
`registry_addrs` entry uses a transport proto not in
`enable_transports` — historically this caused a
silent indefinite hang during the registrar handshake
(the actor could never connect to register/discover).

Also,
- update `test_root_passes_tpt_to_sub` to detect a
  proto mismatch between parametrized `tpt_proto_key`
  and CLI `tpt_proto`, asserting the new guard raises
  `ValueError` with expected msg content.
- replace old commented-out notes with a clearer
  explanation of the mismatch foot-gun.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d036ef7d7f)
2026-06-17 17:39:44 -04:00
Gud Boi 0c64e76bf9 Fix shutdown deadlock on UDS unlink race
Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError`
guard — under concurrent pytest sessions the sock-file can already be
reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before
`_shutdown.set()`, deadlocking `wait_for_shutdown()` on
`actor.cancel()`.

Also,
- close each endpoint independently in the finally so one raise doesn't
  strand the rest.
- always signal `_shutdown.set()` regardless of remaining ep count.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 2ee44a6fdd)
2026-06-17 17:39:44 -04:00
Gud Boi cc5be294b1 Mk per-test reap fixtures opt-in
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.

Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
  per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
  psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e4953851de)
2026-06-17 17:39:44 -04:00
Gud Boi 63c6da9e82 Use single f-string per pid in runaway warning
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 086e9f2c07)
2026-06-17 17:39:44 -04:00
Gud Boi 98fdaf343d Add `tractor.trionics.patches` subpkg + first fix
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.

New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.

Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.

First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
  ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
  & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
  instance whose write-end is closed, and `drain()` then **spins forever
  in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.

Standalone repro:

    from trio._core._wakeup_socketpair import WakeupSocketpair
    ws = WakeupSocketpair()
    ws.write_sock.close()
    ws.drain()  # spins forever

Patch is one-line — break the drain loop on b'' EOF.

Manifested as two distinct test failures:

- `tests/test_multi_program.py::test_register_duplicate_name` hung at
  100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
  deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
  both threads parked in `epoll_wait`, no TCP connect-back to parent
  ever happened.

Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).

Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.

Conc-anal write-ups, including strace + py-spy evidence:

- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`

Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.

TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
      body's evidence section.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 0ef549fadb)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
2026-06-17 17:39:44 -04:00
Gud Boi 73c9b9ec93 Add `tractor.spawn._reap.unlink_uds_bind_addrs()`
Inside a new new `tractor.spawn._reap` submod which kicks off providing
post-mortem subactor cleanup primitives, parent-side; consider it the
"sibling" of `tractor._testing._reap` which is the test-harness-oriented
brother mod.

Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454
where `hard_kill()`'s `SIGKILL` bypasses the subactor's
`_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files..

This adds 2 cleanup paths:
- explicit `bind_addrs` (when set at spawn time),
OR
- convention-based reconstruction from `subactor.aid.name + proc.pid`
  for the random-self-assign case.

`.spawn.hard_kill()` now invokes the cleanup unconditionally
post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError`
skip.

Future work — authoritative tracking via a per-process
UDS bind-addr registry — documented in module docstring,
deferred to a follow-up PR.

Co-fix: `tractor/spawn/_trio.py::new_proc` already passes
`bind_addrs` + `subactor` to `hard_kill` via prior work
on this branch.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e9712dcaeb)
2026-06-17 17:39:44 -04:00
Gud Boi a4f6d893ed Add per-test runaway-subactor CPU detector to `_reap`
New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).

Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5cf0312c78)
2026-06-17 17:39:44 -04:00
Gud Boi 8589a796c1 Fix `maybe_override_capture` to not get invalid capX fixture names..
(cherry picked from commit 32e89c67ee)
2026-06-17 17:39:44 -04:00
Gud Boi afa14b01d2 Add fork-aware capture fixtures to `_testing.pytest`
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.

Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
  globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
  args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
  but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
  fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
  wrapper's fixture injection.

Also,
- add `alert_on_finish` session fixture (terminal
  bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
  our plugin eventually however..

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d549c72052)
2026-06-17 17:39:44 -04:00
Gud Boi 99098c5a14 Update `sync_bp` + tighten `test_pause_from_sync`
Add `disable_pdbp_color()` to the `sync_bp` example
to suppress pygments prompt coloring when
`PYTHON_COLORS=0` — makes pexpect pattern matching
deterministic.

Deats,
- set `loglevel='pdb'` in both script + test spawn.
- disable `enable_stack_on_sig` in example, assert
  no `stackscope` output in test.
- update `attach_patts` keys/values with `|_<Task`
  / `|_<Thread` / `|_('subactor'` prefixes to match
  actual tree-dump format.
- add call-site patterns (`tractor.pause_from_sync()`
  `tractor.pause()`, `breakpoint(hide_tb=...)`).
- trim trailing `\n` from `Lock.repr()` output.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit fc2e298a29)
2026-06-17 17:39:44 -04:00
Gud Boi 158506d8c0 Add `use_stackscope` runtime var for subactor init
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.

Deats,
- add `use_stackscope: bool` to `RuntimeVars`
  struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
  successful `stackscope` import, asserts unset
  on `ImportError`
- nest stackscope init under `_debug_mode` gate
  in `Actor.async_main`, check rtvar alongside
  env var
- defer `maybe_init_greenback` import to its own
  `use_greenback` branch

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 48523358cf)
2026-06-17 17:39:44 -04:00
Gud Boi e1da7be4d9 Fix `SIGUSR1` tree-dump ordering in `_stackscope`
Factor the sub-actor relay loop out of
`dump_tree_on_sig()` into `_relay_sig_to_subactors()`
and chain both dump + relay in a single
`run_sync_soon` callback (`_dump_then_relay`) so the
parent's task-tree flushes BEFORE any sub receives
the signal — fixes a hierarchical-ordering race
where subs could dump ahead of the parent in the
muxed pty stream.

Also,
- gate file/tty sink writes behind `write_file` +
  `write_tty` params on `dump_task_tree()`.
- use `actor.aid.uid` instead of deprecated `.uid`.
- update `test_shield_pause` expects to match the
  new sequential parent -> relay-log -> sub ordering.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e2b790a70d)
2026-06-17 17:39:44 -04:00
Gud Boi af72192461 Add `pytest_load_initial_conftests()` for `--capture=`
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.

Also,
- load `tractor._testing.pytest` via `-p` in ini
  (bc bootstrapping hooks must register before
  conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
  tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
  per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
  and pop env var on disable.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 61d4525137)
2026-06-17 17:39:44 -04:00
Gud Boi 80048ff6ad Add UDS orphan-sweep helpers + reap fixtures to `_reap`
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:

- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
  `reap_uds()` — detect `<name>@<pid>.sock` files under
  `${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
  the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
  lingering subactors, wait, SIGKILL survivors, then sweep orphaned
  UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
  snapshot sock-file dir before/after each test, warn + reap new
  orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
  with known-leaky teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 1cdc7fb302)
2026-06-17 17:39:44 -04:00
Gud Boi d4eac06de2 Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars
Add env-var overrides inside `._root.open_root_actor()` so
devs/test-runs can swap the actor-spawn backend or crank
console verbosity *without* touching application code.

In `._root.open_root_actor()`,
- read `TRACTOR_LOGLEVEL` early, overriding any caller-passed
  `loglevel` and stashing an `env_ll_report` to emit once the
  console log is set up.
- pull the `loglevel` fallback (`or _default_loglevel`) and
  `log.get_console_log()` init *up* so the env-var report
  routes through tractor's own logger.
- read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed
  `start_method` and warn-logging when the env-var clobbers
  an explicit caller value.

Wire the same vars through `tests/devx/conftest.py::spawn`,
- request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL`
  and `TRACTOR_SPAWN_METHOD` in `os.environ` before each
  `pexpect.spawn()` (inherited by the example subproc).
- expand `supported_spawners` to include
  `main_thread_forkserver` and `subint_forkserver` bc
  example scripts no longer need per-script CLI plumbing.
- pop both vars in fixture teardown so a leaked value can't
  re-route a later in-process tractor test's spawn-backend
  or loglevel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 208e7c0926)
2026-06-17 17:39:44 -04:00
Gud Boi c31b85587a Route `stackscope` SIGUSR1 onto trio loop
Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.

Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.

Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).

`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.

Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
  hint in the rendered dump header so users know where
  to find the artifact even when stdio is captured.
- swap the inline `f'     |_{actor}'` line for a
  `_pformat.nest_from_op` rendering of `actor_repr`
  (matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
  branches now note `(trio_token captured: <bool>)`
  so it's obvious from the log whether the full-tree
  path is wired.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 2d4995e08d)
2026-06-17 17:39:44 -04:00
Gud Boi 1d185e4d94 Add `--enable-stackscope` pytest plugin flag
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.

Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
  * exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
    inherit it via environ,
  * installs the handler in pytest itself via
    `enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
  existing `_debug_mode` gate to ALSO fire when
  `TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
  install the same handler at runtime startup.

Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:

- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
  always written) — guaranteed-readable artifact even
  under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
  tty; ignored if device is missing.

Other,
- squelch the benign `RuntimeWarning` ("coroutine method
  'asend'/'athrow' was never awaited") from
  `stackscope._glue`'s import-time async-gen type
  introspection so `--enable-stackscope` setup stays
  quiet.
- log msg in the `_runtime` ImportError branch now
  mentions `--enable-stackscope` alongside debug-mode.

Usage,
  pytest --enable-stackscope -k <hang-test>
  # in another shell, find the pid + signal:
  kill -USR1 <pytest-or-subactor-pid>
  # tail the artifact:
  tail -f /tmp/tractor-stackscope-<pid>.log

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5418f2dc3c)
2026-06-17 17:39:44 -04:00
Gud Boi 3954d9f527 Return parent `pid: int` from new `reap_subactors_per_test` fixture
(cherry picked from commit f8178df0fd)
2026-06-17 17:39:44 -04:00
Gud Boi ec5141b720 Add opt-in `reap_subactors_per_test` fixture
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.

Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...

Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.

Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b376eb0332)
2026-06-17 17:39:44 -04:00
Gud Boi d24ccaada1 Fix `_testing.addr.get_rando_addr` cross-process collisions
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".

Switch to per-call `random.randint()` salted with `os.getpid()`
so:

- within one session: two calls return distinct ports — e.g.
  `test_tpt_bind_addrs::bind-subset-reg` now actually gets two
  different reg addrs on the TCP backend (it was silently
  duplicating before),
- across parallel sessions: pid salt biases each process's
  port choices apart, making cross-run collisions
  vanishingly rare.

Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7c5dd4d033)
2026-06-17 17:39:44 -04:00
Gud Boi f7a58b82fe Sweep `subint_forkserver` → `main_thread_forkserver` in code
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:

- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
  `main_thread_forkserver`, drop the stub-only `subint_forkserver`
  entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
  capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
  xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
  reason-text since the `mp.SharedMemory(track=False)`
  + resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
  the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
  `_testing.pytest`, `_subint.py`, `pyproject.toml`,
  `test_cancellation.py`, `test_registrar.py` — refs to variant-1
  backend updated.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 205382a39b)
(factored: dropped spawn-backend-only path: tractor/spawn/_subint.py)
2026-06-17 17:39:44 -04:00
Gud Boi e5ca5bb017 Add `--shm` orphan sweep to `tractor-reap`
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.

Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
  + `reap_shm()` helpers. Match criteria: regular file
  under `/dev/shm`, owned by current uid, AND no live
  proc has it open (mmap'd or fd-held). In-use
  enumeration via `psutil.Process.memory_maps()` +
  `.open_files()` — xplatform, kernel-canonical (same
  answer `lsof` would give), no reliance on
  tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
  `NotImplementedError` outside Linux/FreeBSD bc macOS
  POSIX shm has no fs-visible path (`shm_open` only)
  and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
  process reap) and `--shm-only` (skip process phase)
  flags. `-n` dry-runs both phases. Exit code is `1`
  if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
  the `testing` dep group; lazy-imported in `_reap.py`
  so the process-reap path stays import-clean without
  it.

Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
  (new section 10c) — covers match criteria + the
  preservation guarantee for unrelated apps.
- flip mitigation status in
  `subint_forkserver_mp_shared_memory_issue.md` from
  "could extend `tractor-reap`" to "implemented", with
  a note that callers should still UUID-pin shm keys to
  avoid cross-session collisions.

Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4f12d69b41)
(factored: dropped subint_forkserver conc-anal doc update)
2026-06-17 17:39:44 -04:00
Gud Boi 0b6a7aa1a9 Fix `SharedMemory` under `subint_forkserver`
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.

Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
  `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
  now run unconditionally:
  * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
    `ManTracker` subclass (empty `register` / `unregister`
    / `ensure_running`).
  * return `partial(SharedMemory, track=False)` for the per-allocation
    opt-out.
  * belt + suspenders: even if something dodges the wrapper, the
    singleton can't talk to the inherited (broken) parent fd.

- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
  skip of the unlink-callback; install a `try_unlink()` wrapper that
  swallows `FileNotFoundError` (sibling-already-cleaned race in
  shared-key setups). Without `mp.resource_tracker` doing it for us, we
  own the unlink — `actor. lifetime_stack` is the right place since
  tractor already controls actor lifecycle.

- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
  module-level skip- list (tests pass now). Inline comment cross-refs
  the two `_mp_bs` / `_shm` workarounds.

- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
  rewrite — flips status from "open / unresolvable in tractor" to
  "resolved, kept as decision record". Adds Resolution section, "Why
  this is the right call" rationale (mp tracker is widely criticized;
  tractor already owns lifecycle), trade-offs (crash-leaked segments,
  lost mp leak warning), verification (7 passed under both
  `subint_forkserver` and `trio` backends), and upstream issue links

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit aa3e230926)
(factored: dropped subint_forkserver conc-anal doc update)
2026-06-17 17:39:44 -04:00
Gud Boi 48ace6dd82 Add `_testing._reap` + auto-reap fixture
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.

Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
  `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
  - `find_descendants(parent_pid)` for the in-session case
    (PPid-direct-match while pytest is still alive).
  - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
    reparented to init + `cwd` filter to repo root + `python` cmdline
    filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
  all, poll up to `grace` for exit, SIGKILL any survivors. Returns
  `(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
  `tractor/_testing/pytest.py` — after `yield`, runs
  `find_descendants(os.getpid())` + `reap(...)` so each pytest session
  leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
  commit) for the pytest-died-mid-session case where the in-session
  fixture didn't get to run.

Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
  module-top in `pytest.py` (was inline-imported inside
  `pytest_generate_tests`), and reuse it in
  `pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
  mark arg is a valid spawn-method literal — catches typos at collection
  time.
- inline `# ?TODO` flags running these through the `try_set_backend`
  checker for stronger validation.

Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit eae478f3d5)
2026-06-17 17:39:44 -04:00
Gud Boi af0b1fb2dc Bound peer-clear wait in `async_main` finally
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e312a68d8a)
(factored: dropped subint_forkserver conc-anal doc update)
2026-06-17 17:39:44 -04:00
Gud Boi d328177873 Break parent-chan shield during teardown
Completes the nested-cancel deadlock fix started in
0cd0b633 (fork-child FD scrub) and fe540d02 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8ac3dfeb85)
2026-06-17 17:39:44 -04:00