Commit Graph

2652 Commits (6de9b4c945d96d4e5a7e2d420dbebef79e2c32af)

Author SHA1 Message Date
Gud Boi 6de9b4c945 Backend-aware timeout in `maybe_expect_raises`
Default `timeout` from `int = 3` → `int|None = None`;
when unset, pick a backend-aware value. Fork-based
backends (`main_thread_forkserver`) need real headroom
bc actor spawn + IPC ctx-exit + msg-validation error
path is much heavier than under `trio` backend —
especially under cross-pytest-stream contention (#451).

Defaults:
- `main_thread_forkserver` → 30s
- everything else          → 3s (unchanged)

Empirical flake history that motivated 30s as the floor
on fork backends (all from `test_basic_payload_spec`):

- 3s  → all-valid variant flaked w/ `TooSlowError`
- 8s  → `invalid-return` variant flaked w/ `Cancelled`
        (surfaced instead of `MsgTypeError` bc the
        outer `fail_after` fired mid-error-path)
- 15s → flaked under cross-pytest-stream contention

30s gives plenty of headroom while still failing-loud
on a genuine hang. Callers can opt out by passing an
explicit `timeout=` kw.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 060f7d24c4)
2026-06-09 23:53:14 -04:00
Gud Boi c065dbcfcf Use `trio.fail_after` cap in `test_dynamic_pub_sub`
Drop `@pytest.mark.timeout(...)` for the per-test wall-clock
cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)`
inside `main()` instead.

Both pytest-timeout enforcement modes are incompatible with
trio under fork-based backends:

- `method='signal'` (SIGALRM) synchronously raises `Failed`
  in trio's main thread mid-`epoll.poll()`, leaving
  `GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got
  abandoned") so EVERY subsequent `trio.run()` in the same
  pytest process bails with
  `RuntimeError: Attempted to call run() from inside a run()`
  — full-session poison.
- `method='thread'` calls `_thread.interrupt_main()` which
  can let the KBI escape trio's `KIManager` under fork-
  cascade teardown races and bubble out of pytest entirely
  — kills the whole session.

`trio.fail_after()` keeps cancellation inside the trio loop:
- Raises `TooSlowError` cleanly through the open-nursery's
  cancel cascade.
- Doesn't disturb any out-of-band signal/thread state.
- Failure stays scoped to the single test — no cross-test
  global state corruption either way.

Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub`
go from 5/10 fail (with global-state poison) to 3/10 fail
(no poison, all sibling tests still pass). The ~30%
remaining flake rate is a genuine fork-cancel-cascade
hang — separate from this fix but no longer contaminates.

Module-level NOTE comment explains the rationale so future
readers don't re-introduce the bug.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 530160fa69)
2026-06-09 23:53:14 -04:00
Gud Boi c69b0e82f0 Sweep `subint_forkserver` → `main_thread_forkserver` in code
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:

- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
  `main_thread_forkserver`, drop the stub-only `subint_forkserver`
  entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
  capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
  xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
  reason-text since the `mp.SharedMemory(track=False)`
  + resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
  the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
  `_testing.pytest`, `_subint.py`, `pyproject.toml`,
  `test_cancellation.py`, `test_registrar.py` — refs to variant-1
  backend updated.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 205382a39b)
(factored: dropped spawn-backend-only path: tractor/spawn/_subint.py)
2026-06-09 23:53:14 -04:00
Gud Boi 00d050fa01 Wire `reg_addr` into `test_context_stream_semantics`
Same wire-up pattern as the prior `test_dynamic_pub_sub`
commit: each test that already pulled in `debug_mode`
now also pulls in `reg_addr` and passes
`registry_addrs=[reg_addr]` into `tractor.open_nursery()`,
so the suite's standard registry-addr conventions apply.

Tests touched:
- `test_started_misuse`
- `test_simple_context`
- `test_parent_cancels`
- `test_one_end_stream_not_opened`
- `test_maybe_allow_overruns_stream`
- `test_ctx_with_self_actor`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 66f1941f46)
2026-06-09 23:53:14 -04:00
Gud Boi 9bc7e461ee Wire `test_dynamic_pub_sub` to standard fixtures
Pull in the `reg_addr`, `debug_mode`, and `test_log`
fixtures so this test follows the same conventions as
the rest of the suite:

- pass `registry_addrs=[reg_addr]` + `debug_mode` into
  `tractor.open_nursery()` (so `--tpdb` etc work).
- after the `pytest.raises` block, add `assert err` +
  `test_log.exception('Timed out AS EXPECTED')` so the
  expected timeout is logged explicitly instead of
  swallowed.

Also,
- drop whitespace-only blank lines around the
  `subs` param of `consumer()` and `ctx` param of
  `one_task_streams_and_one_handles_reqresp()`.
- promote `test_sigint_both_stream_types`'s one-line
  docstring to multi-line form.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9b05f659b3)
2026-06-09 23:53:14 -04:00
Gud Boi 3717d28466 Bump `test_stale_entry_is_deleted`'s timeout to 30
Seems that when run in-suite it delays more then the so-measured "happy
path" timing; better to have no suite-global interruption then asserting
a fast single test's run.

(cherry picked from commit 65fcfbf224)
2026-06-09 23:53:14 -04:00
Gud Boi ec7efd792a Document `SharedMemory` × `subint_forkserver` incompat
New `ai/conc-anal/` doc: `mp.SharedMemory` is
fork-without-exec unsafe — child inherits parent's
`resource_tracker` fd → EBADF on first shm op;
leaked `/shm_list` cascades `FileExistsError`
across parametrize variants. Canonical CPython
issue class, NOT a tractor bug. Includes two
longer-term mitigation paths (reset inherited
tracker fd vs migrate off `mp.shared_memory`).

Also, update `tests/test_shm.py`:
- comment out `subint_forkserver` from skip list
- rewrite reason with precise failure-mode
  descriptions + link to the analysis doc

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit c99d475d03)
(factored: dropped spawn-backend-only paths: ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md)
2026-06-09 23:53:14 -04:00
Gud Boi 3fdf7595e5 Skip `test_loglevel_propagated_to_subactor` on subint forkserver too
(cherry picked from commit 2ca0f41e61)
2026-06-09 23:53:14 -04:00
Gud Boi 07e60b71ee Wire `reg_addr` through infected-asyncio tests
Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.

Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
  calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
  - add `reg_addr`, `debug_mode`, `start_method`
    fixture params
  - `delay` now reads the `debug_mode` param directly
    instead of calling `tractor.debug_mode()` (fires
    slightly earlier in the test lifecycle)
  - sanity assert `if debug_mode: assert
    tractor.debug_mode()` after nursery open
  - new print showing SIGINT target
    (`send_sigint_to` + resolved pid)
  - catch `trio.TooSlowError` around
    `ctx.wait_for_result()` and conditionally
    `pytest.xfail` when `send_sigint_to == 'child'
    and start_method == 'subint_forkserver'` — the
    known orphan-SIGINT limitation tracked in
    `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b350aa09ee)
2026-06-09 23:53:14 -04:00
Gud Boi 60cac2b12f Update `subint_forkserver` skip reason: capture-pipe
Refresh the `test_nested_multierrors` skip-mark
reason to the final diagnosis: the hang is pytest's
default `--capture=fd` pipe filling from high-volume
subactor traceback output inherited via fds 1,2 in
fork children — `pytest -s` passes cleanly. Records
the fix direction (redirect child stdio to
`/dev/null` in the fork-child prelude) for whoever
lands the backend.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit eceed29d4a)
(factored: kept only the tests/test_cancellation.py skip-reason update of
 "Pin forkserver hang to pytest `--capture=fd`"; dropped the subint
 conc-anal doc + tests/spawn/test_subint_forkserver.py)
2026-06-09 23:53:14 -04:00
Gud Boi ca2b7b2483 Skip-mark `subint_forkserver` nested-multierror hang
Skip-mark the still-hanging
`test_nested_multierrors[subint_forkserver]` via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test matrix
while the remaining bug is being chased. The mark is
an inert no-op until that (in-dev) backend lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 506617c695)
(factored: kept only the tests/test_cancellation.py skip-mark; dropped
 the subint_forkserver conc-anal doc update)
2026-06-09 23:53:14 -04:00
Gud Boi 90972e5c6d Wire `reg_addr` through leaky cancel tests
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 1af2121057)
2026-06-09 23:53:14 -04:00
Gud Boi 837662c94d Enable `debug_mode` for `subint_forkserver`
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.

Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
  listing the spawn backends whose subactor runtime is trio-native
  (`'trio'`, `'subint_forkserver'`). Both the enable-site
  (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
  key.
  off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
  reports the full compatible-set + the rejected method instead of the
  stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
  `'subint_forkserver'` to the existing `('trio', 'subint')` tuple
  — fork child-side runtime receives the same SpawnSpec IPC handshake as
  the others.
- `subint_forkserver_proc` child-target now passes
  `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
  `Actor.pformat()` / log lines reflect the actual parent-side spawn
  mechanism rather than masquerading as plain `trio`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8bcbe730bf)
2026-06-09 23:53:14 -04:00
Gud Boi 142c828288 Mark `subint`-hanging tests with `skipon_spawn_backend`
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.

Deats,
- Module-level `pytestmark` on full-file-hanging suites:
  - `tests/test_cancellation.py`
  - `tests/test_inter_peer_cancellation.py`
  - `tests/test_pubsub.py`
  - `tests/test_shm.py`
- Per-test decorator where only one test in the file
  hangs:
  - `tests/discovery/test_registrar.py
    ::test_stale_entry_is_deleted` — replaces the
    inline `if start_method == 'subint': pytest.skip`
    branch with a declarative skip.
  - `tests/test_subint_cancellation.py
    ::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
  place as breadcrumbs for later finer-grained unskips.

Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
  (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
  `tests/conftest.py`, `tests/devx/conftest.py`, and
  `tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
  repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
  `@tractor_test(...)` on `test_cancel_infinite_streamer`
  and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
  `test_cancel_via_SIGINT`; use `start_method` in the
  `'mp' in ...` check instead.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4b2a0886c3)
(factored: dropped spawn-backend-only path: tests/test_subint_cancellation.py)
2026-06-09 23:53:14 -04:00
Gud Boi f75e071a7c Skip `test_stale_entry_is_deleted` hanger with `subint`s
(cherry picked from commit 985ea76de5)
2026-06-09 23:53:14 -04:00
Gud Boi 10dbc9390c Arm `dump_on_hang` on `test_stale_entry_is_deleted`
Wrap the test's `trio.run(main)` in
`dump_on_hang(seconds=20)` so any future hang
regression captures a stack dump for triage instead
of wedging CI silently; under the default backends
it's a no-op safety net.

Includes a "KNOWN ISSUE" comment block documenting
the (future) `subint` backend hang classes observed
against this test during Phase B bringup (#379).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4a3254583b)
(factored: kept only the tests/discovery/test_registrar.py part of
 "Doc `subint` backend hang classes + arm `dump_on_hang`"; dropped
 subint conc-anal docs + tests/test_subint_cancellation.py)
2026-06-09 23:53:14 -04:00
Gud Boi 4e56ee11cb Sync `tractor_diag.xsh` move-out + `_reap` doc wording
Complete the xontrib-side slimming from "Mv core impl
`tractor_diag.xsh` to `_testing.trace`" (its xontrib hunk
couldn't ride with the testing-harness segment since the
xontrib lands in this one) and sync a couple `_reap.py`
issue/doc-ref comment strings to their final form.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-06-09 23:53:14 -04:00
Gud Boi b0bf546b54 Hoist proc-title prefix to `_def_prefix` const
Make the sub-actor proc-title prefix a single
authoritative constant (`_proctitle._def_prefix`) so
the reap-recognition markers and `xontrib` banner pick
it up automatically — one place to flip the prefix
shape going fwd.

Deats,
- `_proctitle._def_prefix: str = '_subactor'`. New
  module-level const consumed by everything that needs
  to know the prefix.
- `set_actor_proctitle(actor, prefix=_def_prefix)`:
  takes an explicit `prefix` arg (default = the const)
  so callers can override per-spawn if they want.
- Default proc-title format:
  `'tractor[<reprol>]'` → `f'{prefix}[<reprol>]'`
  i.e. `_subactor[<reprol>]` by default.
- `_testing/_reap.py`: cmdline + comm markers source
  the prefix from `_proctitle._def_prefix` instead of
  the hardcoded `'tractor['`. So
  `_is_tractor_subactor()` tracks the const
  automatically.
- `xontrib/tractor_diag.xsh`: `acli.reap` orphan-mode
  banner now interpolates the
  `_TRACTOR_PROC_CMDLINE_MARKERS` tuple directly so
  the human-readable mode line stays in sync if the
  prefix shape changes again.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3a45dbd503)
2026-06-09 23:53:14 -04:00
Gud Boi d1b5d79174 Add `acli.watch` flicker-free alias-loop
Per-terminal optimized `watch`-like xonsh alias that
runs an arbitrary callable alias in a loop inside the
alt-screen buffer with flicker-free repaint. Supersedes
the inline `acli.ptree` polling .xsh snippet (removed
from `_ptree` docstr in favor of
`acli.watch acli.ptree pytest`).

Deats,
- alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide
  (`\033[?25l/h`) wrapped in try/finally so Ctrl-C always
  returns to a pristine shell.
- per-frame draw uses cursor-home (`\033[H`) + per-line
  EL (`\033[K` before each `\n`) + post-draw erase-down
  (`\033[J`) → stale tail chars from a longer prior
  frame are obvi cleared; no full-screen flash.
- SIGWINCH-aware: terminal resize sets a flag, next
  frame does a full clear (`\033[H\033[2J`) instead of
  the cheap cursor-home path.
- Ctrl-C handling: install `signal.default_int_handler`
  so `KeyboardInterrupt` lands cleanly; prior handler
  restored on exit.
- Output capture: redirect the alias's stdout to
  `StringIO` per frame so we can post-process the EL
  fix. Aliases writing directly to `sys.stdout.buffer`
  / `os.write(1)` bypass capture — EL-fix won't apply
  but loop still works.
- Alias unwrap: xonsh stores callables as either a bare
  callable OR `[fn, *preset_args]`. Both forms handled;
  subprocess-style aliases rejected w/ a friendly err
  msg.
- `argparse` w/ `-n`/`--interval` (default 0.3s); rest
  of argv forwarded as alias args.
- Reg `'acli.watch': watch` in `_TCLI_ALIASES`.

Other,
- Tn `_ptree` `args: list[str]` param.
- Mod-header `Provides:` block updated w/ `acli.watch`
  entry.
- Top-level imports: `os`, `sys`, `signal`, `time`,
  `typing.Callable`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit bb239e847f)
2026-06-09 23:53:14 -04:00
Gud Boi 9b3d36038b Add `acli.ptree` poll .xsh snippet to docstr
(cherry picked from commit f617c8cb73)
2026-06-09 23:53:14 -04:00
Gud Boi e6fec9dbc8 Add ppid-aware liveness buckets to `bindspace_scan`
Split the old `live`/`orphans` sock classification
into three ppid-aware buckets: `live-active` (PID
alive, parent owns it), `orphaned-alive` (PID alive
but `ppid==1`, init-adopted — `acli.reap` candidate),
and `orphaned-dead` (PID gone, sock stale).

Deats,
- new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent
  PID, handles the tricky `(comm)` field (can contain spaces/parens) by
  splitting from last `)`.
- live-active rows now show `(ppid=<N>)` for ctx.
- orphaned-alive rows flagged `(adopted by init)`.
- cleanup suggestion: `acli.reap --uds` for both
  alive-orphan graceful cancel + dead-sock cleanup
  in one shot; manual `rm` kept as fallback.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9bbb6f796b)
2026-06-09 23:53:14 -04:00
Gud Boi 55ae447c45 Add bare-name arg, `ss` hints to `bindspace_scan`
`acli.bindspace_scan piker` now resolves `<name>` to
`$XDG_RUNTIME_DIR/<name>` — useful for projects like
`piker` that bind sibling sub-dirs alongside tractor's
default. Full paths still work as-is.

Also,
- rename "unparseable" section to "non-tractor" with
  clearer desc (filename lacks `@<pid>` suffix)
- print per-sock `ss -lpx 'src = <path>'` cmds for
  non-tractor socks so callers can manually resolve
  listener-PID liveness

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 099104e0af)
2026-06-09 23:53:14 -04:00
Gud Boi 5d041effe6 Add per-actor `setproctitle` via `devx._proctitle`
New `tractor.devx._proctitle` mod sets each
sub-actor's `argv[0]` (and kernel `comm`) to
`tractor[<aid.reprol()>]` — e.g.
`tractor[doggy@1027301b]` — so `ps`/`top`/`htop`
and `acli.pytree`/reaper tooling can identify
actors at a glance without parsing full cmdlines.

Deats,
- `set_actor_proctitle()` wraps the `setproctitle`
  pkg with `ImportError` guard; optional at runtime
  but listed in `pyproject.toml` so default installs
  benefit.
- called early in `_child._actor_child_main()` after
  `Actor` construction, before `_trio_main()` entry.
- tests in `tests/devx/test_proctitle.py`: format
  unit test, `/proc/{cmdline,comm}` integration
  test, negative detection test.

Resolves #457

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d60245777e)
2026-06-09 23:53:14 -04:00
Gud Boi 1be3bc72bd Add `_is_tractor_subactor()`, cgroup-aware `ptree`
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.

Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
  `tractor._child` markers, falls back to `/proc/<pid>/comm` for
  zombie-resilient detection (kernel preserves `comm` past exit
  until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
  — the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
  `/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
  `_is_tractor_subactor()` for intrinsic sub-actor ID instead of
  cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
  zombie-state sub-actors

Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
  distinguish `system.slice` services vs `user.slice` desktop apps
  from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
  `user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
  tractor importable from active venv

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 522b57570b)
2026-06-09 23:53:14 -04:00
Gud Boi 27f6fe0454 Add `acli.reap`, namespace `tractor_diag` cmds
Group all xontrib aliases under an `acli.` prefix
so xonsh prefix-completion treats them as a sub-cmd
group — `acli.<TAB>` lists the full set. No parent
`acli` cmd exists; the dot is purely naming.

Renames (incl `-` -> `_` in suffixes for shell-
identifier-friendliness):

  - `pytree`         -> `acli.pytree`
  - `hung-dump`      -> `acli.hung_dump`
  - `bindspace-scan` -> `acli.bindspace_scan`

Add new `acli.reap` wrapping `scripts/tractor-reap`:

Deats,
- 3 opt-in phases via flags:

  1. process reap — `find_orphans()` (default,
     PPid=1 + cwd=repo + cmdline `python`) or
     `find_descendants(--parent PID)`. SIGINT
     first, SIGKILL after `--grace` (def 3.0s).

  2. `/dev/shm` sweep (`--shm`/`--shm-only`) —
     `find_orphaned_shm()` + `reap_shm()`. needed
     bc `tractor` disables `mp.resource_tracker`.

  3. UDS sock-file sweep (`--uds`/`--uds-only`) —
     `find_orphaned_uds()` + `reap_uds()` for stale
     `${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock`
     entries. See #452.

- `--dry-run` lists matches without signalling/
  unlinking; survivor pids or sweep errors flip
  the alias rc to `1`.
- lazy-imports `tractor._testing._reap` after
  `git rev-parse --show-toplevel` (with
  `Path(__file__).parent.parent` fallback) so the
  contrib is loadable before the venv is on
  `sys.path`.
- `argparse.SystemExit` on `-h`/bad-args is
  caught + returned as the alias rc instead of
  killing xonsh.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit cec6cc2a56)
2026-06-09 23:52:52 -04:00
Gud Boi 59fb0053e5 Add `--tree` flag and cross-bucket parent annos to `pytree`
Extend `pytree` with two usability improvements:

- `--tree`/`-t` opt-in flag emits a flat walk-order `## tree` section at
  the top preserving contiguous parent-child shape (no
  severity-grouping), so the full tree structure is visible without
  cross-ref'ing between severity buckets.

- Cross-bucket parent annotation: when a row's parent (by ppid) lives in
  a *different* severity bucket, suffix with `[parent: <pid> (in
  `<bucket>`)]` so the `└─` marker resolves even when bucketing scatters
  parent/child into separate sections.

Also,
- split arg parsing into flag vs positional args.
- add `pid_to_bucket` dict + `walk_order` list to back both features
- rename inner `ppid` shadow to `ppid_str` to avoid collision with the
  outer `ppid` variable.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 0f4e671862)
2026-06-09 23:52:52 -04:00
Gud Boi dba40af771 Add `tractor_diag`(nosis) xontrib with aliases
Xonsh xontrib providing three diagnostic commands
for tractor development / hang investigation:

- `pytree <pid|pat>` — psutil-backed proc tree with severity-bucketed
  output (zombies > orphans > live), tree-depth markers, zombie-safe
  rendering.
- `hung-dump <pid|pat>` — kernel `wchan`/`stack` + `py-spy dump
  --locals` per descendant, sudo-cred caching upfront, pgrep fallback
  when psutil absent.
- `bindspace-scan [<dir>]` — scan UDS bindspace for orphaned
  `<name>@<pid>.sock` files whose binder pid is dead, emit `rm`
  one-liner for cleanup.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7b14fdcd96)
2026-06-09 23:52:52 -04:00
Gud Boi 22c241fbd4 Filter `_find_tractor_strays` by ppid disposition
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).

Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.

Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
  short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
  filter rule + its rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit a6d4ac3aac)
2026-06-09 23:52:52 -04:00
Gud Boi 2cd0908bdb Add init-adopted orphan reap to `reap_subactors_per_test`
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 01ce2857ea)

(factored: also default `find_orphans(repo_root=None)` -> cwd so the new bare call sites work ahead of the later intrinsic-identity rewrite)
2026-06-09 23:52:52 -04:00
Gud Boi e84dac233d Add subtree-walk to `reap()` for full actor-tree teardown
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).

Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8de684f5de)
2026-06-09 23:24:18 -04:00
Gud Boi ec5f080ecc Add hang-snapshot session index to pytest summary
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
  populated by `_do_capture_snapshot()` on each successful dump;
  add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
  prints all captured snapshot dirs at end-of-session so paths
  don't get buried in scrollback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit fb87c36263)
2026-06-09 23:24:18 -04:00
Gud Boi dd73e045c0 Add stray-proc scan + refine `_testing.trace` capture
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
  `tractor._child` procs NOT in the walk's `seen` set — surfaces
  ghost subactor trees from prior test runs (cross-test launchpad
  contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
  into `_classify_walk()` closure, walk stray roots as additional
  trees, emit stray-root summary in header. Also: `tractor._child`
  procs reparented to init are now always classified as orphans
  regardless of cgroup-slice (leaked subactor ≠ desktop-launched
  app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
  `--capture=sys` redirection so snapshot paths always land on the
  real terminal
- `fail_after_w_trace()`: capture diag snapshot on
  non-`TooSlowError` exceptions when the `fail_after` scope's
  cancel had already fired (e.g. nursery wraps `Cancelled` into a
  `BaseExceptionGroup` that escapes before `TooSlowError` can be
  raised).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3a243a1fd4)
2026-06-09 23:24:18 -04:00
Gud Boi 0e32b511bc Mv core impl `tractor_diag.xsh` to `_testing.trace`
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.

Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
  hung-state dumper, bindspace scanner, `dump_all()` snapshot
  archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
  (trio `fail_after` + auto-snapshot on `TooSlowError`),
  `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
  `SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
  import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
  `_testing.trace` — drops ~627 lines of inline impl. Add
  `acli.dump_all` alias for full snapshot-bundle CLI access.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7509e313ff)

(factored: the xontrib-side move-out hunk rides with the diag-xontrib segment)
2026-06-09 23:24:18 -04:00
Gud Boi 9fb1c4ccc0 Mk `--capture` guard CI-aware w/ local warn
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:

- CI (`CI` env-var set): `pytest.exit(rc=2)` on
  mismatch — forces every matrix-row to declare
  `--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
  experiment with `--capture=fd` to validate fixes.

Deats,
- drop `_cap_fd_set` global; add
  `_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
  spawner-name lookup
- move inline comment wall → proper docstring w/
  Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
  `request: pytest.FixtureRequest` and reads
  `request.config.option.capture` instead of the
  `_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
  capture deadlocks)
- `set_fork_aware_capture()` returns the actual
  capture mode str from config, not a hardcoded
  `'sys'`
- lift `import warnings` to module level (was duped
  inside `pytest_configure`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 255c9c3a7c)
2026-06-09 23:24:18 -04:00
Gud Boi c0f5bd2915 Mk per-test reap fixtures opt-in
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.

Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
  per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
  psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e4953851de)
2026-06-09 23:24:18 -04:00
Gud Boi 32a7ead862 Use single f-string per pid in runaway warning
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 086e9f2c07)
2026-06-09 23:24:18 -04:00
Gud Boi 35e8880075 Add per-test runaway-subactor CPU detector to `_reap`
New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).

Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5cf0312c78)
2026-06-09 23:24:18 -04:00
Gud Boi eb89db81a5 Fix `maybe_override_capture` to not get invalid capX fixture names..
(cherry picked from commit 32e89c67ee)
2026-06-09 23:24:18 -04:00
Gud Boi dd1d6cd51e Add fork-aware capture fixtures to `_testing.pytest`
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.

Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
  globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
  args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
  but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
  fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
  wrapper's fixture injection.

Also,
- add `alert_on_finish` session fixture (terminal
  bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
  our plugin eventually however..

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d549c72052)
2026-06-09 23:24:18 -04:00
Gud Boi 82e25c442a Add `pytest_load_initial_conftests()` for `--capture=`
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.

Also,
- load `tractor._testing.pytest` via `-p` in ini
  (bc bootstrapping hooks must register before
  conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
  tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
  per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
  and pop env var on disable.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 61d4525137)
2026-06-09 23:24:18 -04:00
Gud Boi 90c46288ad Add `--uds`/`--uds-only` flags to `tractor-reap`
Wire up `find_orphaned_uds()` + `reap_uds()` from
`_reap` as a new phase-3 UDS sweep in the CLI
script. Opt-in via `--uds` (run after proc reap +
shm) or `--uds-only` (skip other phases).

Also,
- consolidate skip-proc-reap logic into a single
  `skip_proc_reap` bool covering both `--shm-only`
  and `--uds-only`
- extend header docstring + usage examples

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 0996a83655)
2026-06-09 23:24:18 -04:00
Gud Boi 053051535f Add UDS orphan-sweep helpers + reap fixtures to `_reap`
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:

- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
  `reap_uds()` — detect `<name>@<pid>.sock` files under
  `${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
  the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
  lingering subactors, wait, SIGKILL survivors, then sweep orphaned
  UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
  snapshot sock-file dir before/after each test, warn + reap new
  orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
  with known-leaky teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 1cdc7fb302)
2026-06-09 23:24:18 -04:00
Gud Boi 4c50d610e6 Flip back to default `pytest` capture for CI
(cherry picked from commit 22cdf15b73)
2026-06-09 23:24:18 -04:00
Gud Boi 9a844b91f3 Drop global `pytest-timeout` cap from `pyproject.toml`
`timeout = 200` was firing via SIGALRM (the default
`method='signal'`) which synchronously raises `Failed` in
trio's main thread mid-`epoll.poll()`, abandoning trio's
runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half-
installed. EVERY subsequent `trio.run()` in the same pytest
session then bails with
`RuntimeError: Attempted to call run() from inside a run()`.

Empirical impact: a session that hits a single 200s hang
cascades into 30-40 false-positive failures across every
downstream test file that uses `trio.run`. Recent UDS run
saw 1 real timeout (`test_unregistered_err_still_relayed`)
poison 38 sibling tests with cascade-fails — a debugging
nightmare.

Same architectural bug we already documented in
`tests/test_advanced_streaming.py::test_dynamic_pub_sub`
(see its module-level NOTE) — both `pytest-timeout`
enforcement modes are incompatible with trio under fork-
based spawn backends. Now scoped session-wide.

For tests that legitimately need a wall-clock cap, the
canonical pattern is `with trio.fail_after(N):` INSIDE the
test — trio's own `Cancelled` machinery cleanly unwinds
the actor nursery without disturbing global state.

For CI: rely on job-level wall-clock timeouts (e.g. GitHub
Actions `timeout-minutes`) to abort genuinely-stuck suites.

`pyproject.toml` comment block spells this all out so a
future contributor doesn't reach back for `timeout =` and
re-introduce the bug.

ALSO, bump `xonsh` to at least `0.23.0` release.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3c366cac13)

(factored: the xonsh pin/editable-source hunks already landed with the devenv segment)
2026-06-09 23:24:18 -04:00
Gud Boi dcb00e5a8f Return parent `pid: int` from new `reap_subactors_per_test` fixture
(cherry picked from commit f8178df0fd)
2026-06-09 23:24:18 -04:00
Gud Boi 94d233a2f7 Add opt-in `reap_subactors_per_test` fixture
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.

Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...

Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.

Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b376eb0332)
2026-06-09 23:24:18 -04:00
Gud Boi bac291dff4 Fix `_testing.addr.get_rando_addr` cross-process collisions
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".

Switch to per-call `random.randint()` salted with `os.getpid()`
so:

- within one session: two calls return distinct ports — e.g.
  `test_tpt_bind_addrs::bind-subset-reg` now actually gets two
  different reg addrs on the TCP backend (it was silently
  duplicating before),
- across parallel sessions: pid salt biases each process's
  port choices apart, making cross-run collisions
  vanishingly rare.

Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7c5dd4d033)
2026-06-09 23:24:18 -04:00
Gud Boi 7bcb30f6a6 Add `--shm` orphan sweep to `tractor-reap`
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.

Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
  + `reap_shm()` helpers. Match criteria: regular file
  under `/dev/shm`, owned by current uid, AND no live
  proc has it open (mmap'd or fd-held). In-use
  enumeration via `psutil.Process.memory_maps()` +
  `.open_files()` — xplatform, kernel-canonical (same
  answer `lsof` would give), no reliance on
  tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
  `NotImplementedError` outside Linux/FreeBSD bc macOS
  POSIX shm has no fs-visible path (`shm_open` only)
  and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
  process reap) and `--shm-only` (skip process phase)
  flags. `-n` dry-runs both phases. Exit code is `1`
  if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
  the `testing` dep group; lazy-imported in `_reap.py`
  so the process-reap path stays import-clean without
  it.

Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
  (new section 10c) — covers match criteria + the
  preservation guarantee for unrelated apps.
- flip mitigation status in
  `subint_forkserver_mp_shared_memory_issue.md` from
  "could extend `tractor-reap`" to "implemented", with
  a note that callers should still UUID-pin shm keys to
  avoid cross-session collisions.

Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4f12d69b41)
(factored: dropped subint_forkserver conc-anal doc update)
2026-06-09 23:24:18 -04:00
Gud Boi 6de96b508f Add `tractor-reap` CLI + document auto-reap
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:

- orphan-mode (default): finds PPid==1 procs
  with cwd matching repo root + `python` in
  cmdline.
- descendant-mode (`--parent <pid>`): scoped
  sweep under a still-live supervisor.

SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).

Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 6d76b60404)
2026-06-09 23:24:18 -04:00
Gud Boi 34e28cd2e7 Add `_testing._reap` + auto-reap fixture
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.

Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
  `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
  - `find_descendants(parent_pid)` for the in-session case
    (PPid-direct-match while pytest is still alive).
  - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
    reparented to init + `cwd` filter to repo root + `python` cmdline
    filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
  all, poll up to `grace` for exit, SIGKILL any survivors. Returns
  `(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
  `tractor/_testing/pytest.py` — after `yield`, runs
  `find_descendants(os.getpid())` + `reap(...)` so each pytest session
  leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
  commit) for the pytest-died-mid-session case where the in-session
  fixture didn't get to run.

Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
  module-top in `pytest.py` (was inline-imported inside
  `pytest_generate_tests`), and reuse it in
  `pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
  mark arg is a valid spawn-method literal — catches typos at collection
  time.
- inline `# ?TODO` flags running these through the `try_set_backend`
  checker for stronger validation.

Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit eae478f3d5)
2026-06-09 23:24:18 -04:00