Commit Graph

2678 Commits (fdac157d3d74f2e641468289f522e8b95470f3e6)

Author SHA1 Message Date
Gud Boi fdac157d3d Harden cancel-ack hard-kill escalation
Two defensive fixes around the `Portal.cancel_actor()` +
`_try_cancel_then_kill()` escalation from `34f333a0`
"Escalate cancel-ack timeouts to `proc.terminate()`" (the
`trionics.start_or_cancel` follow-up); surfaced by
`/code-review high` on #462,

- guard `proc.terminate()` for backends whose `proc` slot
  isn't a `Process` — the future `subint` backend stores an
  `int` interp-id, so escalation would `AttributeError`
  instead of hard-killing; now it logs + no-ops.
- swap `assert cs.cancelled_caught` for an
  `if cs.cancelled_caught and raise_on_timeout:` guard so an
  unexpected shielded-scope exit returns a soft `False`
  rather than crashing `cancel_actor()` mid-teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 19:46:04 -04:00
Gud Boi d28173f4c0 Make SIGUSR1 `stackscope` dumps actually work
Two fixes to the hang-debug SIGUSR1 task-tree dump path,
surfaced by `/code-review high` on #462,

- re-add `_debug_mode` to the sub-actor handler-install gate
  in `_runtime.py`. Dropping it (rel. `3a386ba5`/`3d9c75b6`
  "Drop debug_mode gate", from the `custom_log_levels_api`
  follow-up) was meant to *also* enable non-pdb runs, but
  nothing sets `use_stackscope` from `debug_mode`, so
  debug-mode subs were left with NO handler — and the default
  SIGUSR1 disposition then *kills* them. Now additive:
  `_debug_mode OR use_stackscope OR env`.
- pass `write_file=True` at both `dump_task_tree()` SIGUSR1
  call sites so the advertised `/tmp/tractor-stackscope-<pid>`
  `.log` tee is actually written (was dead under
  `--capture=fd`). Matches `1b1ef10a` "Re-enable writing
  `stackscope` to file by default"; param from `0df90500`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 19:45:21 -04:00
Gud Boi f08a7d52b5 Drop stray `breakpoint()` in `RuntimeVars.__setattr__`
Left-over debug trap from the `_runtime_vars` pure get/set
refactor — it fired on *every* struct-form rt-var write (e.g.
via `.update()`), hanging any non-tty / CI / forked actor on
`pdb` stdin.

Surfaced by a `/code-review high` pass on #462.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 19:44:34 -04:00
Gud Boi 41b5371473 Relock `uv.lock` after rebase onto `main`
Regenerate the lockfile so it's consistent with the
post-rebase `pyproject.toml` — which now carries both #461's
landed tooling (`pytest>=9.0.3`, …) and this branch's
tractor deps (`setproctitle`, `pytest-timeout`, `psutil`),

- `uv lock` resolves the merged dep set against the landed
  `main` baseline.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 17:39:44 -04:00
Gud Boi 6d45581910 Add autouse fixture to reset `_runtime_vars` per-test
`open_root_actor()` writes `_enable_tpts` (and friends) into
the process-global `_state._runtime_vars` dict but nothing
resets it on actor teardown. Under the in-proc `pytest`
launchpad a uds-using test leaks `_enable_tpts=['uds']` into
a sibling tcp test, tripping the
`registry_addrs`×`enable_transports` proto-guard in
`open_root_actor()` with a `ValueError`.

New `_reset_runtime_vars` fixture snapshots + restores the
dict around every test so no runtime-var state crosses a
test boundary.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-17 17:39:44 -04:00
Gud Boi 6a7dea45bb Bump trio `echoserver` cancel timeout 1→4s
Same trio 0.29 → 0.33 cancel-cascade slowdown that hit
`test_nested_multierrors` (ea67f1b6) — bumps the
`trio`-backend (non-debug, non-forking) budget in
`test_echoserver_detailed_mechanics` from 1s → 4s.

- The 1s budget raced the ~1s teardown deadline. On a
  deadline-fire trio 0.33 injects
  `Cancelled(source='deadline')` (cancel-reason
  metadata) that wraps the mid-stream KBI in a
  `BaseExceptionGroup`, breaking the bare
  `pytest.raises(KeyboardInterrupt)` below.
- Bump matches the forking-spawner branch (4s).
- Inline NOTE references the tracking issue
  `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d7da502d93)
(cherry picked from commit d0144e52cb)
2026-06-17 17:39:44 -04:00
Gud Boi fa0208e65f Bump trio depth=3 cancel timeout 6→12s
trio 0.29 → 0.33 lock bump (c7741bba) slowed the
depth=3 cancel-cascade in `test_nested_multierrors`
from <6s to ~7-8s; the 6s deadline was firing and its
`Cancelled(source='deadline')` (trio 0.33's new
cancel-reason metadata) collapsed a BEG branch,
breaking the `RemoteActorError` assertion downstream.

- Split the `('trio', _)` case-match into per-depth
  arms: `('trio', 1)` keeps 6s (still finishes in
  ~3s); `('trio', 3)` → 12s.
- Updated inline NOTE explains the version pivot +
  links the tracking issue
  `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
- Existing MTF/`subint_forkserver` budgets unchanged.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit ea67f1b67b)
(cherry picked from commit 57b3ea59ea)
2026-06-17 17:39:44 -04:00
Gud Boi b6bf865f5e Fix `get_logger()` collapse of nested sub-pkgs
Strip the trailing `pkg_path` token ONLY when it duplicates the
caller's leaf-*module* name (which the console header already
shows via `{filename}`), instead of blindly dropping the last
token. This keeps genuine, possibly-*nested* sub-PACKAGE parts
addressable as their own sub-loggers.

- detect a true leaf-mod by comparing the caller's `__name__`
  vs `__package__` (a pkg `__init__` has them equal -> its
  trailing token is a real sub-pkg, NOT a leaf to strip).
- `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT
  from a bare `devx` -> `tractor.devx`; the old unconditional
  `pkg_path = subpkg_path` collapsed both to `tractor.devx` and
  silently broke per-sub-pkg level control via the logging-spec.
- `get_logger(__name__)` leaf-strip still works (cosmetic, bc
  the leaf-mod is in the `{filename}` header field).

Also,
- update the `LogSpec` caveat: sub-PACKAGE granularity now
  addressable at ANY depth; leaf *modules* intentionally aren't
  (they're the `{filename}`); top-level mods (eg. `to_asyncio`)
  still emit on the root logger.
- adjust `test_root_pkg_not_duplicated_in_logger_name` to the
  new literal explicit-`name` contract (no leaf-collapse).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9c36363b01)
2026-06-17 17:39:44 -04:00
Gud Boi 8bb9df9e06 Lift `--ll`/`--tl` to plugin + `LogSpec` API
Two coupled changes that let downstream projects (eg. `modden`) inherit
the test-harness loglevel plumbing for free via
`tractor._testing.pytest`:

Plugin lift (`tests/conftest.py` → `_testing/pytest.py`),
- mv `pytest_addoption(--ll)`, the `loglevel` autouse
  fixture, and `test_log` fixture out of the test-suite-
  local conftest into the reusable plugin.
- add `--tl`/`--tractor-loglevel` as a DISTINCT flag from
  `--ll`: `--ll` is the consuming-project's OWN app
  loglevel (scoped to its pkg-hierarchy), `--tl` is the
  `tractor.*` runtime loglevel. `--tl` falls back to
  `--ll` when unset (preserves current `tractor`-suite
  behavior).
- add `testing_pkg_name` session fixture (default
  `'tractor'`) — downstream projects override to e.g.
  `'modden'` so `--ll` scopes to their own hierarchy
  instead of `tractor.*`.
- `loglevel` fixture now yields the resolved
  tractor-runtime level (passed to
  `open_root_actor(loglevel=<.>)` by `@tractor_test`)
  AND separately applies `--ll` to the
  `testing_pkg_name` hierarchy when that isn't
  `tractor`. `test_log` scopes the per-test logger to
  `testing_pkg_name`.

`tractor.log` "logging-spec" mini-DSL,
- `LogSpec = str|bool`. Accepted forms:
  - `True` → enable `pkg_name` root at `default_level`
    (fallback `'cancel'`).
  - `False` → no-op.
  - bare level eg. `'info'` → root-logger at that level.
  - `'sub:info,x:cancel'` → per-sub-logger filter-spec;
    each `<name>` is RELATIVE to `pkg_name` (must NOT
    include the pkg-token).
- `parse_logspec()` → `{sublog|None: level}` mapping.
  `None` key = root-logger. Mixed bare-level + filters
  in one spec is rejected w/ a helpful err msg; so is
  embedding the `pkg_name` token in a sub-name.
- `apply_logspec()` → `(primary_level, {name: log})`:
  parses then enables a `colorlog` stderr handler per
  named (sub)logger. Authoritative sub-logger filters
  get `propagate=False` so they don't double-emit
  through a parallel root-level handler.
- !GRANULARITY CAVEAT! sub-logger names match at
  sub-pkg granularity, not leaf-module — so `devx.debug`
  collapses to the same `tractor.devx` logger as a bare
  `devx`, and top-level lib modules (eg.
  `tractor.to_asyncio`) emit under the *root* logger
  rather than a phantom `to_asyncio` child. Documented
  inline on `LogSpec`.

Other,
- `tests/conftest.py` keeps a NOTE pointing to the
  plugin for future-debugging clarity (don't remove
  silently — the lift is the relevant signal).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 19a77708ba)
2026-06-17 17:39:44 -04:00
Gud Boi 944262d8a6 Add `maybe_signal_aio_task()` + cause-chain guard
Factor the "deliver an exc to a running aio task" pattern out of
`translate_aio_errors()` + `open_channel_from()` into a shared
`maybe_signal_aio_task()` helper. Add a cause-chain matrix comment
+ relay-echo guard so the final-raise block can't cycle
  `trio_err.__cause__` back onto its own derivative relay.

`maybe_signal_aio_task()`,
- Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT
  `aio_task.set_exception()` which on py3.13+ ALWAYS raises
  `RuntimeError("Task does not support set_exception")` (dead code as
  a relay mechanism).
- Returns `(delivered: bool, report: str)`. Caller uses `delivered` to
  flip `wait_on_aio_task` when delivery failed (avoids hanging on
  `_aio_task_complete.wait()`).
- `pre_captured_fut=`: required when the caller crosses a trio
  checkpoint between capturing `_fut_waiter` and invoking the helper.
  `Task._wakeup` clears `_fut_waiter = None` so re-reading
  post-checkpoint loses the ref even though the exc is still in-flight
  on the (now-`done()`) original fut.
- `cause=`: sets `exc.__cause__ = cause` so the relay carries
  a "trio_err -> caused -> relay" chain through `set_exception()`
  → `Task._wakeup` → coro raise → `wait_on_coro_final_result`
  → `signal_trio_when_done` → `task.result()`-raise.
- `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the
  narrow case where `_fut_waiter is None` AND task is runnable (sitting
  in asyncio's ready queue, not parked on a poke-able future). NEVER
  cancels when `_fut_waiter` carries an in-flight exc — that would race
  + mask the real terminating exc.

`translate_aio_errors()`,
- Replace the two ad-hoc `_fut_waiter.set_exception()`
  / `aio_task.set_exception()` call sites w/ the helper.
- Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown
  `trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref).
- New "cross-loop cause-chain matrix" comment block on the final-raise
  — tabulates every `(trio_err, aio_err, trio_to_raise)` combo into
  exactly one terminal `raise X [from Y]` or early `return`. Covers the
  sibling `signal_trio_when_done()` resolution + the relay-echo
  INVARIANT.
- New relay-echo guard: if `aio_err` is one of OUR OWN signals
  (`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is
  trio_err`, raise the bare `trio_err` instead of `trio_err from
  aio_err` (which would CYCLE the cause chain since the relay was itself
  caused-by `trio_err`).
- Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT
  EXCEPT or this WILL HANG" warning — the helper handles the failure
  path explicitly via `delivered=False` → `wait_on_aio_task = False`.
- Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and
  the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback
  shows the real root.

`open_channel_from()`,
- Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?"
  + `tractor.pause(shield=True)` panic branch.
- Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as
  `cause=` — non-`None` only when the `try` body raised (graceful exit
  → None).

Other,
- Top-level import: `sys` (for `sys.exc_info()`).
- `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await
  trio_done_fute` next to the shielded version — exploratory note for
  the longstanding "why is `.shield()` needed?" TODO.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit acd1cbeec4)
2026-06-17 17:39:44 -04:00
Gud Boi b57d095204 Drop `debug_mode` gate on stackscope SIGUSR1
SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).

Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
  conjunction around the `enable_stack_on_sig()` call;
  now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
  `use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
  of the `if rvs['_debug_mode']:` block to peer-level.
  The `use_greenback` branch stays inside `_debug_mode`
  (pdb-specific).
- Refresh inline comments on both sites to call out the
  infected-`asyncio` "default SIGUSR1 = terminate proc"
  rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3d9c75b6ed)
2026-06-17 17:39:44 -04:00
Gud Boi 4b70b564c8 Use trace CM helpers in `test_infected_asyncio`
Adopt the `_testing.trace` CM helpers in two MTF-hang-prone
tests so on-timeout we get a fresh
`ptree`/`wchan`/`py-spy` diag snapshot on disk instead of
opaque pytest timeout-kills. Same shape as bd07a95d for
`test_dynamic_pub_sub`.

Deats,
- `test_echoserver_detailed_mechanics`:
  * inner `trio.fail_after` → `fail_after_w_trace`. Adds
    `fail_after_w_trace: FailAfterWTraceFactory` fixture
    param.
  * mv per-backend `timeout` calc to top of test body (was
    interleaved w/ helper defs).
  * factor deep
    `open_nursery`/`open_context`/`open_stream` body into
    `_body()` so the wrapping `main()` stays a 2-liner —
    keeps the nested-CM block at its natural indent level
    instead of pushing it under yet another `async with`.
  * drop `with_timeout: bool` knob + `fa_main()` helper
    (knob was hard-coded `True`).
- `test_sigint_closes_lifetime_stack`:
  * outer `signal.alarm`/`try`/`finally` → single
    `afk_alarm_w_trace(10)` CM. Adds
    `afk_alarm_w_trace: AfkAlarmWTraceFactory` fixture
    param.
  * drop `_AFK_CAP_S` + `armed_alarm` vars (CM owns both).
  * explanatory comment refreshed to mention
    `AFKAlarmTimeout` + the disk-snapshot side effect.

Other,
- Drop debug `return 1e3` short-circuit from `delay()`
  fixture — snuck in as a scratch line, was clobbering the
  proper `debug_mode`-branched return.
- Top-level import: `FailAfterWTraceFactory`,
  `AfkAlarmWTraceFactory` from `tractor._testing.trace`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 1cafaecf52)
2026-06-17 17:39:44 -04:00
Gud Boi da582a4d1b Add `acli.watch` flicker-free alias-loop
Per-terminal optimized `watch`-like xonsh alias that
runs an arbitrary callable alias in a loop inside the
alt-screen buffer with flicker-free repaint. Supersedes
the inline `acli.ptree` polling .xsh snippet (removed
from `_ptree` docstr in favor of
`acli.watch acli.ptree pytest`).

Deats,
- alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide
  (`\033[?25l/h`) wrapped in try/finally so Ctrl-C always
  returns to a pristine shell.
- per-frame draw uses cursor-home (`\033[H`) + per-line
  EL (`\033[K` before each `\n`) + post-draw erase-down
  (`\033[J`) → stale tail chars from a longer prior
  frame are obvi cleared; no full-screen flash.
- SIGWINCH-aware: terminal resize sets a flag, next
  frame does a full clear (`\033[H\033[2J`) instead of
  the cheap cursor-home path.
- Ctrl-C handling: install `signal.default_int_handler`
  so `KeyboardInterrupt` lands cleanly; prior handler
  restored on exit.
- Output capture: redirect the alias's stdout to
  `StringIO` per frame so we can post-process the EL
  fix. Aliases writing directly to `sys.stdout.buffer`
  / `os.write(1)` bypass capture — EL-fix won't apply
  but loop still works.
- Alias unwrap: xonsh stores callables as either a bare
  callable OR `[fn, *preset_args]`. Both forms handled;
  subprocess-style aliases rejected w/ a friendly err
  msg.
- `argparse` w/ `-n`/`--interval` (default 0.3s); rest
  of argv forwarded as alias args.
- Reg `'acli.watch': watch` in `_TCLI_ALIASES`.

Other,
- Tn `_ptree` `args: list[str]` param.
- Mod-header `Provides:` block updated w/ `acli.watch`
  entry.
- Top-level imports: `os`, `sys`, `signal`, `time`,
  `typing.Callable`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit bb239e847f)
2026-06-17 17:39:44 -04:00
Gud Boi 30242d03fb Add `acli.ptree` poll .xsh snippet to docstr
(cherry picked from commit f617c8cb73)
2026-06-17 17:39:44 -04:00
Gud Boi fad1227d7c Filter `_find_tractor_strays` by ppid disposition
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).

Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.

Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
  short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
  filter rule + its rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit a6d4ac3aac)
2026-06-17 17:39:44 -04:00
Gud Boi c8a77fb92b Use trace CM helpers in `test_dynamic_pub_sub`
Replace inline `trio.fail_after` + manual `signal.alarm` guard with the
`_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy
diag snapshot to disk on timeout.

Deats,
- inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM,
  captures on `TooSlowError`).
- outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync
  CM, captures on `SIGALRM`), only armed under fork backends.
  Extracts `_run_and_match()` helper to keep branching clean.
- bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes
  while diag harness accumulates evidence.
- drop `_DIAG_CAP_S` var + manual signal import (now internal to
  `afk_alarm_w_trace`).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit bd07a95d80)
2026-06-17 17:39:44 -04:00
Gud Boi 68698afac7 Harden `test_cancellation` for fork-spawner backends
Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
  conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
  `track_orphaned_uds_per_test`,
  `detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
  MTF `xfail(strict=False)` with detailed race-window comment
  explaining the BEG shape mismatch, wrap body in
  `fail_after_w_trace` with per-backend timeout budget, bump
  `@tractor_test(timeout=10)`, drop old multiprocessing depth
  special-casing
- `test_multierror_fast_nursery`: wrap in
  `fail_after_w_trace(30.0)`, accept `TooSlowError` in
  `pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
  `spawn_backend` param for `is_forking_spawner`, widen
  `fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
  `test_cancel_infinite_streamer`, `test_some_cancels_all`: add
  `set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
  covered by module-level `pytestmark`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 32955db02e)
2026-06-17 17:39:44 -04:00
Gud Boi 98522661d6 Add init-adopted orphan reap to `reap_subactors_per_test`
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 01ce2857ea)
2026-06-17 17:39:44 -04:00
Gud Boi dfe853e159 Add subtree-walk to `reap()` for full actor-tree teardown
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).

Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8de684f5de)
2026-06-17 17:39:44 -04:00
Gud Boi f0a1971814 Add hang-snapshot session index to pytest summary
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
  populated by `_do_capture_snapshot()` on each successful dump;
  add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
  prints all captured snapshot dirs at end-of-session so paths
  don't get buried in scrollback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit fb87c36263)
2026-06-17 17:39:44 -04:00
Gud Boi 46c1147f6e Add stray-proc scan + refine `_testing.trace` capture
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
  `tractor._child` procs NOT in the walk's `seen` set — surfaces
  ghost subactor trees from prior test runs (cross-test launchpad
  contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
  into `_classify_walk()` closure, walk stray roots as additional
  trees, emit stray-root summary in header. Also: `tractor._child`
  procs reparented to init are now always classified as orphans
  regardless of cgroup-slice (leaked subactor ≠ desktop-launched
  app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
  `--capture=sys` redirection so snapshot paths always land on the
  real terminal
- `fail_after_w_trace()`: capture diag snapshot on
  non-`TooSlowError` exceptions when the `fail_after` scope's
  cancel had already fired (e.g. nursery wraps `Cancelled` into a
  `BaseExceptionGroup` that escapes before `TooSlowError` can be
  raised).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3a243a1fd4)
2026-06-17 17:39:44 -04:00
Gud Boi da2d5bb3d2 Mv core impl `tractor_diag.xsh` to `_testing.trace`
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.

Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
  hung-state dumper, bindspace scanner, `dump_all()` snapshot
  archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
  (trio `fail_after` + auto-snapshot on `TooSlowError`),
  `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
  `SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
  import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
  `_testing.trace` — drops ~627 lines of inline impl. Add
  `acli.dump_all` alias for full snapshot-bundle CLI access.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7509e313ff)
2026-06-17 17:39:44 -04:00
Gud Boi 36ad56843e Harden `test_infected_asyncio` for fork spawners
Deats,
- `test_echoserver_detailed_mechanics`: add `is_forking_spawner`
  param, wrap `main()` in `fa_main()` with per-backend
  `trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade
  teardown that compounds under forkserver.
- `test_sigint_closes_lifetime_stack`: swap `start_method` param
  for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so
  KBI firing before `open_context` body doesn't `UnboundLocalError`,
  add `pytest.fail` guard for the spawn-time IPC race case, arm
  `signal.alarm` AFK-safety cap (10s) under fork backends

Also,
- `pytestmark`: add `track_orphaned_uds_per_test` +
  `detect_runaway_subactors_per_test` fixtures.
- `delay()`: hardcode `return 1e3` at top (debug override still in
  place).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7ee0dc2e8f)
2026-06-17 17:39:44 -04:00
Gud Boi acb042ec77 Adjust `test_streaming_to_actor_cluster` timeout
For forking spawner backends that is.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b10011a36e)
2026-06-17 17:39:44 -04:00
Gud Boi 75c87831cb Enrich `pytestmark` in `test_inter_peer_cancellation`
- `skipon_spawn_backend('subint')`: expand reason with specific
  analysis doc refs + GH issue #379 umbrella link.
- add `track_orphaned_uds_per_test` fixture via `usefixtures` to
  blame-attribute UDS sock-file orphans left by SIGKILL cancel
  cascades.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7d0a53d205)
2026-06-17 17:39:44 -04:00
Gud Boi 105f0c2944 Adjust `test_simple_context` timeout for forking spawner
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 75d5b4cf7b)
2026-06-17 17:39:44 -04:00
Gud Boi 48a0f2fc49 Add `set_fork_aware_capture`, timeout to msg tests
- `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with
  `trio.fail_after(2)` + commented `capfd.disabled()` investigation
  (pytest#14444).
- `test_basic_payload_spec`: add fixture param with note on fork-spawner
  hang prevention.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8aa07a7932)
2026-06-17 17:39:44 -04:00
Gud Boi d70e003184 Add signal-alarm guard to `test_dynamic_pub_sub`
Outer `signal.alarm` cap that fires even when trio's
`fail_after` is blocked by a shielded-await deadlock
(the bug-class-3 hang under MTF backends). Only armed
for fork-based spawners where the bug lives.

Deats,
- `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the
  trio-native guard so it always loses when the in-band path works.
- `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the
  last-fired breadcrumb names the swallow point on hang.
- try/finally wrapping around each scope level for deterministic
  breadcrumb emission.
- add `is_forking_spawner`, `set_fork_aware_capture` fixture params.
- rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12).

Also,
- `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add
  TODO re `pytest.raises()`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 10db117864)
2026-06-17 17:39:44 -04:00
Gud Boi aced458350 Fix `is_forking_spawner` fixture to call helper fn
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 83b6a3373a)
2026-06-17 17:39:44 -04:00
Gud Boi 42881b3d38 Add ppid-aware liveness buckets to `bindspace_scan`
Split the old `live`/`orphans` sock classification
into three ppid-aware buckets: `live-active` (PID
alive, parent owns it), `orphaned-alive` (PID alive
but `ppid==1`, init-adopted — `acli.reap` candidate),
and `orphaned-dead` (PID gone, sock stale).

Deats,
- new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent
  PID, handles the tricky `(comm)` field (can contain spaces/parens) by
  splitting from last `)`.
- live-active rows now show `(ppid=<N>)` for ctx.
- orphaned-alive rows flagged `(adopted by init)`.
- cleanup suggestion: `acli.reap --uds` for both
  alive-orphan graceful cancel + dead-sock cleanup
  in one shot; manual `rm` kept as fallback.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9bbb6f796b)
2026-06-17 17:39:44 -04:00
Gud Boi f5940de5c0 Add boot-race conc-anal, widen `xfail` to `n_dups=8`
New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md`
documenting the spawn-time rc=2 race under rapid
same-name spawning against a forkserver + registrar
— the `wait_for_peer_or_proc_death` helper now surfaces
the death instead of parking forever on the handshake
wait.

Also,
- extract inline `xfail` into module-level
  `_DOGGY_BOOT_RACE_XFAIL` marker.
- apply it to `n_dups=8` too (previously bare) bc
  larger N widens the race window enough to fire
  occasionally.
- link to tracking issue #456.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 92443dc4ef)
2026-06-17 17:39:44 -04:00
Gud Boi 9f9a536fa9 Adjust legacy streaming test timeouts for fork+UDS
Forking spawner + UDS transport has different timing
vs `trio_proc` — streaming example completes faster
in some cases, slower in others depending on fork
overhead + sock setup.

Deats,
- add `expect_cancel` param to `cancel_after()`, raise
  `ActorTooSlowError` when cancel scope fires unexpectedly instead of
  silently returning `None`.
- `time_quad_ex` fixture: bump timeout +1 for forking+UDS, explicit
  `ActorTooSlowError` on `None` result instead of bare `assert results`.
- `test_not_fast_enough_quad`: `xfail` for forking+UDS being "too fast"
  (cancel doesn't fire bc streaming finishes before delay).
- add `is_forking_spawner`, `tpt_proto` fixture params throughout.

Also,
- `_testing/pytest.py`: widen `start_method` parametrize and
  `is_forking_spawner` fixture to `scope='session'`.
- `"""` -> `'''` docstring style throughout.
- hoist `_non_linux` to module scope (was redefined locally in two
  places).
- type hints, kwarg-style `partial()` calls.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d3cbc92751)
2026-06-17 17:39:44 -04:00
Gud Boi f0b944f8db Add bare-name arg, `ss` hints to `bindspace_scan`
`acli.bindspace_scan piker` now resolves `<name>` to
`$XDG_RUNTIME_DIR/<name>` — useful for projects like
`piker` that bind sibling sub-dirs alongside tractor's
default. Full paths still work as-is.

Also,
- rename "unparseable" section to "non-tractor" with
  clearer desc (filename lacks `@<pid>` suffix)
- print per-sock `ss -lpx 'src = <path>'` cmds for
  non-tractor socks so callers can manually resolve
  listener-PID liveness

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 099104e0af)
2026-06-17 17:39:44 -04:00
Gud Boi 5f2e89ed1f Harden `test_registrar` with reap fixtures, timeouts
Add module-level `pytestmark` applying per-test
`reap_subactors_per_test`, `track_orphaned_uds_per_test`, and
`detect_runaway_subactors_per_test` fixtures — registrar tests stress
discovery roundtrips that historically left orphaned UDS sock-files.

Deats,
- drop unused `say_hello()` fn, keep only `say_hello_use_wait`;
  rename param `func` -> `ria_fn`.
- use `@tractor_test(timeout=7)` instead of separate
  `@pytest.mark.timeout(7, method='thread')` decorator.
- add `with_timeout()` helper, wire into
  `test_subactors_unregister_on_cancel_remote_daemon`.
- uncomment `_timeout_main()` in `test_stale_entry_is_deleted`, use
  configurable `timeout` var + `debug_mode` guard for `tractor.pause()`
  on cancel.
- `dump_on_hang(seconds=timeout*2)` instead of hardcoded `20`.
- fix typo "oustanding" -> "outstanding".

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit abd3950ba6)
2026-06-17 17:39:44 -04:00
Gud Boi 3c6ea8009b Add `_is_tractor_subactor()`, cgroup-aware `ptree`
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.

Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
  `tractor._child` markers, falls back to `/proc/<pid>/comm` for
  zombie-resilient detection (kernel preserves `comm` past exit
  until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
  — the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
  `/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
  `_is_tractor_subactor()` for intrinsic sub-actor ID instead of
  cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
  zombie-state sub-actors

Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
  distinguish `system.slice` services vs `user.slice` desktop apps
  from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
  `user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
  tractor importable from active venv

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 522b57570b)
2026-06-17 17:39:44 -04:00
Gud Boi 338f0a1463 Add per-actor `setproctitle` via `devx._proctitle`
New `tractor.devx._proctitle` mod sets each
sub-actor's `argv[0]` (and kernel `comm`) to
`tractor[<aid.reprol()>]` — e.g.
`tractor[doggy@1027301b]` — so `ps`/`top`/`htop`
and `acli.pytree`/reaper tooling can identify
actors at a glance without parsing full cmdlines.

Deats,
- `set_actor_proctitle()` wraps the `setproctitle`
  pkg with `ImportError` guard; optional at runtime
  but listed in `pyproject.toml` so default installs
  benefit.
- called early in `_child._actor_child_main()` after
  `Actor` construction, before `_trio_main()` entry.
- tests in `tests/devx/test_proctitle.py`: format
  unit test, `/proc/{cmdline,comm}` integration
  test, negative detection test.

Resolves #457

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d60245777e)
2026-06-17 17:39:44 -04:00
Gud Boi 6b14c3dbe7 Add dup-name cancel-cascade escalation test
Extend `test_register_duplicate_name` w/ cancel-level log
breadcrumbs and `try/finally` for better diag on the cancel-cascade
hang.

Add `test_dup_name_cancel_cascade_escalates_to_hard_kill` as a
regression test for the TCP+MTF duplicate-name cancel-cascade
deadlock. Spawns N same-name actors, calls `an.cancel()`, and
asserts teardown completes within a `trio.fail_after()` budget that
scales w/ `n_dups`.

Deats,
- parametrize `n_dups` (2, 4, 8) to widen the race window for
  concurrent `register_actor` RPCs.
- `n_dups=4` xfail'd — exposes a separate boot-race bug (doggy
  `rc=2` under rapid same-name spawn), tracked in #456.
- post-teardown asserts all `Portal` chans disconnect, verifying
  hard-kill escalation worked.

Relates to https://github.com/goodboy/tractor/issues/456

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit caebf60f4e)
2026-06-17 17:39:44 -04:00
Gud Boi 2cd668dcc8 Add `acli.reap`, namespace `tractor_diag` cmds
Group all xontrib aliases under an `acli.` prefix
so xonsh prefix-completion treats them as a sub-cmd
group — `acli.<TAB>` lists the full set. No parent
`acli` cmd exists; the dot is purely naming.

Renames (incl `-` -> `_` in suffixes for shell-
identifier-friendliness):

  - `pytree`         -> `acli.pytree`
  - `hung-dump`      -> `acli.hung_dump`
  - `bindspace-scan` -> `acli.bindspace_scan`

Add new `acli.reap` wrapping `scripts/tractor-reap`:

Deats,
- 3 opt-in phases via flags:

  1. process reap — `find_orphans()` (default,
     PPid=1 + cwd=repo + cmdline `python`) or
     `find_descendants(--parent PID)`. SIGINT
     first, SIGKILL after `--grace` (def 3.0s).

  2. `/dev/shm` sweep (`--shm`/`--shm-only`) —
     `find_orphaned_shm()` + `reap_shm()`. needed
     bc `tractor` disables `mp.resource_tracker`.

  3. UDS sock-file sweep (`--uds`/`--uds-only`) —
     `find_orphaned_uds()` + `reap_uds()` for stale
     `${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock`
     entries. See #452.

- `--dry-run` lists matches without signalling/
  unlinking; survivor pids or sweep errors flip
  the alias rc to `1`.
- lazy-imports `tractor._testing._reap` after
  `git rev-parse --show-toplevel` (with
  `Path(__file__).parent.parent` fallback) so the
  contrib is loadable before the venv is on
  `sys.path`.
- `argparse.SystemExit` on `-h`/bad-args is
  caught + returned as the alias rc instead of
  killing xonsh.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit cec6cc2a56)
2026-06-17 17:39:44 -04:00
Gud Boi 9426fea5bd Escalate cancel-ack timeouts to `proc.terminate()`
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:

  graceful cancel-req -> bounded wait -> hard-kill

Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
  When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
  of the legacy DEBUG-log + return-`False` path. Default stays `False`
  for callers that handle their own escalation (e.g.
  `_spawn.soft_kill()` polling `proc.poll()`).

- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
  cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
  (SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
  waiting on `proc.poll()`.

- replace `tn.start_soon(portal.cancel_actor)` in
  `ActorNursery.cancel()` with the helper.

Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
  REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
  opt-in).

ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.

Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 34f333a026)
2026-06-17 17:39:44 -04:00
Gud Boi e9b63e8569 Add `ActorTooSlowError` for cancel-cascade timeouts
Distinct from `trio.TooSlowError` so that existing
`except trio.TooSlowError:` blocks don't silently
mask actor-cancel timeouts — these must propagate
to let a supervisor escalate to
`proc.terminate()` per SC-discipline:

  graceful cancel-req -> bounded wait -> hard-kill

Motivated by #subint_forkserver dup-name hang
where `Portal.cancel_actor()` silently swallowed
the timeout and the supervisor never escalated,
leaving a same-named sibling subactor parked
forever.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 38ffb875bd)
2026-06-17 17:39:44 -04:00
Gud Boi 4c600dc528 Tidy proto-guard `ValueError` fmt in `open_root_actor()`
Pre-compute `mismatch_lines` str instead of `+`-concat
inside the f-string raise site; slightly easier to read
and avoids the `+ '\n\n'` continuation.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5cd06810db)
2026-06-17 17:39:44 -04:00
Gud Boi 254a46c345 Mk `--capture` guard CI-aware w/ local warn
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:

- CI (`CI` env-var set): `pytest.exit(rc=2)` on
  mismatch — forces every matrix-row to declare
  `--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
  experiment with `--capture=fd` to validate fixes.

Deats,
- drop `_cap_fd_set` global; add
  `_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
  spawner-name lookup
- move inline comment wall → proper docstring w/
  Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
  `request: pytest.FixtureRequest` and reads
  `request.config.option.capture` instead of the
  `_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
  capture deadlocks)
- `set_fork_aware_capture()` returns the actual
  capture mode str from config, not a hardcoded
  `'sys'`
- lift `import warnings` to module level (was duped
  inside `pytest_configure`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 255c9c3a7c)
2026-06-17 17:39:44 -04:00
Gud Boi f500bebbe6 Add `--tree` flag and cross-bucket parent annos to `pytree`
Extend `pytree` with two usability improvements:

- `--tree`/`-t` opt-in flag emits a flat walk-order `## tree` section at
  the top preserving contiguous parent-child shape (no
  severity-grouping), so the full tree structure is visible without
  cross-ref'ing between severity buckets.

- Cross-bucket parent annotation: when a row's parent (by ppid) lives in
  a *different* severity bucket, suffix with `[parent: <pid> (in
  `<bucket>`)]` so the `└─` marker resolves even when bucketing scatters
  parent/child into separate sections.

Also,
- split arg parsing into flag vs positional args.
- add `pid_to_bucket` dict + `walk_order` list to back both features
- rename inner `ppid` shadow to `ppid_str` to avoid collision with the
  outer `ppid` variable.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 0f4e671862)
2026-06-17 17:39:44 -04:00
Gud Boi a0456fece4 Add `enable_transports`/`registry_addrs` proto guard
Raise `ValueError` from `open_root_actor()` when any
`registry_addrs` entry uses a transport proto not in
`enable_transports` — historically this caused a
silent indefinite hang during the registrar handshake
(the actor could never connect to register/discover).

Also,
- update `test_root_passes_tpt_to_sub` to detect a
  proto mismatch between parametrized `tpt_proto_key`
  and CLI `tpt_proto`, asserting the new guard raises
  `ValueError` with expected msg content.
- replace old commented-out notes with a clearer
  explanation of the mismatch foot-gun.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d036ef7d7f)
2026-06-17 17:39:44 -04:00
Gud Boi 0c64e76bf9 Fix shutdown deadlock on UDS unlink race
Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError`
guard — under concurrent pytest sessions the sock-file can already be
reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before
`_shutdown.set()`, deadlocking `wait_for_shutdown()` on
`actor.cancel()`.

Also,
- close each endpoint independently in the finally so one raise doesn't
  strand the rest.
- always signal `_shutdown.set()` regardless of remaining ep count.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 2ee44a6fdd)
2026-06-17 17:39:44 -04:00
Gud Boi 84d52835ab Add `tractor_diag`(nosis) xontrib with aliases
Xonsh xontrib providing three diagnostic commands
for tractor development / hang investigation:

- `pytree <pid|pat>` — psutil-backed proc tree with severity-bucketed
  output (zombies > orphans > live), tree-depth markers, zombie-safe
  rendering.
- `hung-dump <pid|pat>` — kernel `wchan`/`stack` + `py-spy dump
  --locals` per descendant, sudo-cred caching upfront, pgrep fallback
  when psutil absent.
- `bindspace-scan [<dir>]` — scan UDS bindspace for orphaned
  `<name>@<pid>.sock` files whose binder pid is dead, emit `rm`
  one-liner for cleanup.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7b14fdcd96)
2026-06-17 17:39:44 -04:00
Gud Boi cc5be294b1 Mk per-test reap fixtures opt-in
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.

Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
  per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
  psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e4953851de)
2026-06-17 17:39:44 -04:00
Gud Boi d8a398f6f5 Mv `daemon` + `test_multi_program` to `discovery/`
All `daemon` fixture consumers are discovery-
protocol tests now living under `tests/discovery/`.
Move the fixture, its `_wait_for_daemon_ready`
helper, and `test_multi_program.py` into that subdir
so scope matches usage.

Also,
- add `pytestmark` for `track_orphaned_uds_per_test`
  + `detect_runaway_subactors_per_test` to `test_multi_program` as
    regression net.
- drop now-unused `_PROC_SPAWN_WAIT` + `socket` import from root
  conftest.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit c4082be876)
2026-06-17 17:39:44 -04:00
Gud Boi e873bb6164 Replace sleep with active poll in `daemon` fixture
First draft at resolving,
https://github.com/goodboy/tractor/issues/424

`tests.conftest.py.daemon()` previously used a blind
`time.sleep(_PROC_SPAWN_WAIT + uds_bonus + ci_bonus)` to "wait for the
daemon to come up" before yielding the proc to the test.

Two problems:

1. **Racy under load** — sleep is fixed at design time; loaded boxes
   / cold starts / fork-spawn cost spikes blow past it, leading to
   `ConnectionRefusedError` /`OSError: connect failed` flakes in
   `test_register_duplicate_name`.

2. **Wasteful when daemon comes up fast** — happy-path pays the FULL
   sleep regardless. ~3s of dead time per fixture invocation, ~10-20s
   per full suite run.

Replace with `_wait_for_daemon_ready()` — active poll via stdlib
`socket.create_connection` (TCP) or `socket.connect` (UDS) on the
daemon's bind addr, with 50ms backoff and a 10s/15s deadline (CI gets
extra headroom). Daemon-died-during-startup early-exit catches the case
where `_PROC_SPAWN_WAIT` was silently masking daemon startup crashes.

Why stdlib `socket` (Option 2 from the conc-anal doc) instead of
`tractor`'s own `_root.ping_tpt_socket` closure or trio?

- `tractor.run_daemon()` doesn't return from bootstrap until the runtime
  is fully ready to handle IPC, so probing listen-side acceptance is
  sufficient.
- no need to do the full IPC handshake just to validate readiness.
  Sidesteps the `trio.run()` bootstrap cost (~50ms) per fixture too.

`claude`'s verification: 10/10 runs of `tests/test_multi_program.py`
pass on both `--tpt-proto=tcp` and `--tpt-proto=uds`. Per-test wall-time
`test_register_duplicate_name`: 4.31s → 1.10s. Full file: ~12s → 3.27s
per transport.

Doc-tracked at:
`ai/conc-anal/test_register_duplicate_name_daemon_connect_race_issue.md`

Future work — session-scoped trio runtime in a bg thread to share
fixture-side trio operations across many fixtures (currently overkill
for the one fixture that needs it).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit ec8c4659c4)
2026-06-17 17:39:44 -04:00
Gud Boi 63c6da9e82 Use single f-string per pid in runaway warning
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 086e9f2c07)
2026-06-17 17:39:44 -04:00