In `pyproject.toml`,
- include the `sync_pause` group from `dev`, so dev
installs ship `greenback` for `pause_from_sync()`.
Comment out per-test `@pytest.mark.timeout(...)`
markers in,
- `tests/devx/test_debugger.py`
- `tests/discovery/test_registrar.py`
- `tests/spawn/test_main_thread_forkserver.py`
- `tests/spawn/test_subint_cancellation.py`
- `tests/test_advanced_streaming.py`
- `tests/test_cancellation.py`
The global cap was already dropped (3c366cac); these
were the leftover per-test caps which now block
interactive `pdb` flows under the new spawn backends.
In `uv.lock`,
- pull `greenback` into the resolved `dev` deps
(per the `sync_pause` include above).
- catch up the prior `xonsh` editable→PyPI switch
(from the `pyproject.toml` `tool.uv.sources` edit).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b7115fc875)
(factored: dropped spawn-backend-only paths under tests/spawn/)
Mirror `060f7d24`'s pattern (backend-aware timeout in
`maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard
`trio.fail_after` cap. Fork-based backends pay per-spawn
fork+IPC-handshake cost which stacks over `cpus - 1`
sequential `n.run_in_actor()` calls; empirically 12s
flakes on `main_thread_forkserver` under UDS
cross-pytest contention (#451 / #452).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 12s (unchanged)
Hoist the timeout-pick out of the `main()` closure so the
dispatch happens once in the trio task rather than
re-evaluating per spawn.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 383b0fdd75)
Default `timeout` from `int = 3` → `int|None = None`;
when unset, pick a backend-aware value. Fork-based
backends (`main_thread_forkserver`) need real headroom
bc actor spawn + IPC ctx-exit + msg-validation error
path is much heavier than under `trio` backend —
especially under cross-pytest-stream contention (#451).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 3s (unchanged)
Empirical flake history that motivated 30s as the floor
on fork backends (all from `test_basic_payload_spec`):
- 3s → all-valid variant flaked w/ `TooSlowError`
- 8s → `invalid-return` variant flaked w/ `Cancelled`
(surfaced instead of `MsgTypeError` bc the
outer `fail_after` fired mid-error-path)
- 15s → flaked under cross-pytest-stream contention
30s gives plenty of headroom while still failing-loud
on a genuine hang. Callers can opt out by passing an
explicit `timeout=` kw.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 060f7d24c4)
Drop `@pytest.mark.timeout(...)` for the per-test wall-clock
cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)`
inside `main()` instead.
Both pytest-timeout enforcement modes are incompatible with
trio under fork-based backends:
- `method='signal'` (SIGALRM) synchronously raises `Failed`
in trio's main thread mid-`epoll.poll()`, leaving
`GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got
abandoned") so EVERY subsequent `trio.run()` in the same
pytest process bails with
`RuntimeError: Attempted to call run() from inside a run()`
— full-session poison.
- `method='thread'` calls `_thread.interrupt_main()` which
can let the KBI escape trio's `KIManager` under fork-
cascade teardown races and bubble out of pytest entirely
— kills the whole session.
`trio.fail_after()` keeps cancellation inside the trio loop:
- Raises `TooSlowError` cleanly through the open-nursery's
cancel cascade.
- Doesn't disturb any out-of-band signal/thread state.
- Failure stays scoped to the single test — no cross-test
global state corruption either way.
Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub`
go from 5/10 fail (with global-state poison) to 3/10 fail
(no poison, all sibling tests still pass). The ~30%
remaining flake rate is a genuine fork-cancel-cascade
hang — separate from this fix but no longer contaminates.
Module-level NOTE comment explains the rationale so future
readers don't re-introduce the bug.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 530160fa69)
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:
- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
`main_thread_forkserver`, drop the stub-only `subint_forkserver`
entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
reason-text since the `mp.SharedMemory(track=False)`
+ resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
`_testing.pytest`, `_subint.py`, `pyproject.toml`,
`test_cancellation.py`, `test_registrar.py` — refs to variant-1
backend updated.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 205382a39b)
(factored: dropped spawn-backend-only path: tractor/spawn/_subint.py)
Same wire-up pattern as the prior `test_dynamic_pub_sub`
commit: each test that already pulled in `debug_mode`
now also pulls in `reg_addr` and passes
`registry_addrs=[reg_addr]` into `tractor.open_nursery()`,
so the suite's standard registry-addr conventions apply.
Tests touched:
- `test_started_misuse`
- `test_simple_context`
- `test_parent_cancels`
- `test_one_end_stream_not_opened`
- `test_maybe_allow_overruns_stream`
- `test_ctx_with_self_actor`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 66f1941f46)
Pull in the `reg_addr`, `debug_mode`, and `test_log`
fixtures so this test follows the same conventions as
the rest of the suite:
- pass `registry_addrs=[reg_addr]` + `debug_mode` into
`tractor.open_nursery()` (so `--tpdb` etc work).
- after the `pytest.raises` block, add `assert err` +
`test_log.exception('Timed out AS EXPECTED')` so the
expected timeout is logged explicitly instead of
swallowed.
Also,
- drop whitespace-only blank lines around the
`subs` param of `consumer()` and `ctx` param of
`one_task_streams_and_one_handles_reqresp()`.
- promote `test_sigint_both_stream_types`'s one-line
docstring to multi-line form.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9b05f659b3)
Seems that when run in-suite it delays more then the so-measured "happy
path" timing; better to have no suite-global interruption then asserting
a fast single test's run.
(cherry picked from commit 65fcfbf224)
New `ai/conc-anal/` doc: `mp.SharedMemory` is
fork-without-exec unsafe — child inherits parent's
`resource_tracker` fd → EBADF on first shm op;
leaked `/shm_list` cascades `FileExistsError`
across parametrize variants. Canonical CPython
issue class, NOT a tractor bug. Includes two
longer-term mitigation paths (reset inherited
tracker fd vs migrate off `mp.shared_memory`).
Also, update `tests/test_shm.py`:
- comment out `subint_forkserver` from skip list
- rewrite reason with precise failure-mode
descriptions + link to the analysis doc
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit c99d475d03)
(factored: dropped spawn-backend-only paths: ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md)
Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.
Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
- add `reg_addr`, `debug_mode`, `start_method`
fixture params
- `delay` now reads the `debug_mode` param directly
instead of calling `tractor.debug_mode()` (fires
slightly earlier in the test lifecycle)
- sanity assert `if debug_mode: assert
tractor.debug_mode()` after nursery open
- new print showing SIGINT target
(`send_sigint_to` + resolved pid)
- catch `trio.TooSlowError` around
`ctx.wait_for_result()` and conditionally
`pytest.xfail` when `send_sigint_to == 'child'
and start_method == 'subint_forkserver'` — the
known orphan-SIGINT limitation tracked in
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b350aa09ee)
Refresh the `test_nested_multierrors` skip-mark
reason to the final diagnosis: the hang is pytest's
default `--capture=fd` pipe filling from high-volume
subactor traceback output inherited via fds 1,2 in
fork children — `pytest -s` passes cleanly. Records
the fix direction (redirect child stdio to
`/dev/null` in the fork-child prelude) for whoever
lands the backend.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit eceed29d4a)
(factored: kept only the tests/test_cancellation.py skip-reason update of
"Pin forkserver hang to pytest `--capture=fd`"; dropped the subint
conc-anal doc + tests/spawn/test_subint_forkserver.py)
Skip-mark the still-hanging
`test_nested_multierrors[subint_forkserver]` via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test matrix
while the remaining bug is being chased. The mark is
an inert no-op until that (in-dev) backend lands.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 506617c695)
(factored: kept only the tests/test_cancellation.py skip-mark; dropped
the subint_forkserver conc-anal doc update)
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).
Deats,
- add `reg_addr: tuple` fixture param to:
- `test_cancel_infinite_streamer`
- `test_some_cancels_all`
- `test_nested_multierrors`
- `test_cancel_via_SIGINT`
- `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
two `open_nursery()` calls that previously had no
kwargs at all (in `test_cancel_via_SIGINT` and
`test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
to `test_nested_multierrors` so a hung run doesn't
wedge the whole session
Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 1af2121057)
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.
Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
listing the spawn backends whose subactor runtime is trio-native
(`'trio'`, `'subint_forkserver'`). Both the enable-site
(`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
key.
off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
reports the full compatible-set + the rejected method instead of the
stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
`'subint_forkserver'` to the existing `('trio', 'subint')` tuple
— fork child-side runtime receives the same SpawnSpec IPC handshake as
the others.
- `subint_forkserver_proc` child-target now passes
`spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
`Actor.pformat()` / log lines reflect the actual parent-side spawn
mechanism rather than masquerading as plain `trio`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8bcbe730bf)
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.
Deats,
- Module-level `pytestmark` on full-file-hanging suites:
- `tests/test_cancellation.py`
- `tests/test_inter_peer_cancellation.py`
- `tests/test_pubsub.py`
- `tests/test_shm.py`
- Per-test decorator where only one test in the file
hangs:
- `tests/discovery/test_registrar.py
::test_stale_entry_is_deleted` — replaces the
inline `if start_method == 'subint': pytest.skip`
branch with a declarative skip.
- `tests/test_subint_cancellation.py
::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
place as breadcrumbs for later finer-grained unskips.
Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
(`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
`tests/conftest.py`, `tests/devx/conftest.py`, and
`tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
`@tractor_test(...)` on `test_cancel_infinite_streamer`
and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
`test_cancel_via_SIGINT`; use `start_method` in the
`'mp' in ...` check instead.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4b2a0886c3)
(factored: dropped spawn-backend-only path: tests/test_subint_cancellation.py)
Wrap the test's `trio.run(main)` in
`dump_on_hang(seconds=20)` so any future hang
regression captures a stack dump for triage instead
of wedging CI silently; under the default backends
it's a no-op safety net.
Includes a "KNOWN ISSUE" comment block documenting
the (future) `subint` backend hang classes observed
against this test during Phase B bringup (#379).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4a3254583b)
(factored: kept only the tests/discovery/test_registrar.py part of
"Doc `subint` backend hang classes + arm `dump_on_hang`"; dropped
subint conc-anal docs + tests/test_subint_cancellation.py)
Complete the xontrib-side slimming from "Mv core impl
`tractor_diag.xsh` to `_testing.trace`" (its xontrib hunk
couldn't ride with the testing-harness segment since the
xontrib lands in this one) and sync a couple `_reap.py`
issue/doc-ref comment strings to their final form.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Make the sub-actor proc-title prefix a single
authoritative constant (`_proctitle._def_prefix`) so
the reap-recognition markers and `xontrib` banner pick
it up automatically — one place to flip the prefix
shape going fwd.
Deats,
- `_proctitle._def_prefix: str = '_subactor'`. New
module-level const consumed by everything that needs
to know the prefix.
- `set_actor_proctitle(actor, prefix=_def_prefix)`:
takes an explicit `prefix` arg (default = the const)
so callers can override per-spawn if they want.
- Default proc-title format:
`'tractor[<reprol>]'` → `f'{prefix}[<reprol>]'`
i.e. `_subactor[<reprol>]` by default.
- `_testing/_reap.py`: cmdline + comm markers source
the prefix from `_proctitle._def_prefix` instead of
the hardcoded `'tractor['`. So
`_is_tractor_subactor()` tracks the const
automatically.
- `xontrib/tractor_diag.xsh`: `acli.reap` orphan-mode
banner now interpolates the
`_TRACTOR_PROC_CMDLINE_MARKERS` tuple directly so
the human-readable mode line stays in sync if the
prefix shape changes again.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3a45dbd503)
Per-terminal optimized `watch`-like xonsh alias that
runs an arbitrary callable alias in a loop inside the
alt-screen buffer with flicker-free repaint. Supersedes
the inline `acli.ptree` polling .xsh snippet (removed
from `_ptree` docstr in favor of
`acli.watch acli.ptree pytest`).
Deats,
- alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide
(`\033[?25l/h`) wrapped in try/finally so Ctrl-C always
returns to a pristine shell.
- per-frame draw uses cursor-home (`\033[H`) + per-line
EL (`\033[K` before each `\n`) + post-draw erase-down
(`\033[J`) → stale tail chars from a longer prior
frame are obvi cleared; no full-screen flash.
- SIGWINCH-aware: terminal resize sets a flag, next
frame does a full clear (`\033[H\033[2J`) instead of
the cheap cursor-home path.
- Ctrl-C handling: install `signal.default_int_handler`
so `KeyboardInterrupt` lands cleanly; prior handler
restored on exit.
- Output capture: redirect the alias's stdout to
`StringIO` per frame so we can post-process the EL
fix. Aliases writing directly to `sys.stdout.buffer`
/ `os.write(1)` bypass capture — EL-fix won't apply
but loop still works.
- Alias unwrap: xonsh stores callables as either a bare
callable OR `[fn, *preset_args]`. Both forms handled;
subprocess-style aliases rejected w/ a friendly err
msg.
- `argparse` w/ `-n`/`--interval` (default 0.3s); rest
of argv forwarded as alias args.
- Reg `'acli.watch': watch` in `_TCLI_ALIASES`.
Other,
- Tn `_ptree` `args: list[str]` param.
- Mod-header `Provides:` block updated w/ `acli.watch`
entry.
- Top-level imports: `os`, `sys`, `signal`, `time`,
`typing.Callable`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit bb239e847f)
Split the old `live`/`orphans` sock classification
into three ppid-aware buckets: `live-active` (PID
alive, parent owns it), `orphaned-alive` (PID alive
but `ppid==1`, init-adopted — `acli.reap` candidate),
and `orphaned-dead` (PID gone, sock stale).
Deats,
- new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent
PID, handles the tricky `(comm)` field (can contain spaces/parens) by
splitting from last `)`.
- live-active rows now show `(ppid=<N>)` for ctx.
- orphaned-alive rows flagged `(adopted by init)`.
- cleanup suggestion: `acli.reap --uds` for both
alive-orphan graceful cancel + dead-sock cleanup
in one shot; manual `rm` kept as fallback.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9bbb6f796b)
`acli.bindspace_scan piker` now resolves `<name>` to
`$XDG_RUNTIME_DIR/<name>` — useful for projects like
`piker` that bind sibling sub-dirs alongside tractor's
default. Full paths still work as-is.
Also,
- rename "unparseable" section to "non-tractor" with
clearer desc (filename lacks `@<pid>` suffix)
- print per-sock `ss -lpx 'src = <path>'` cmds for
non-tractor socks so callers can manually resolve
listener-PID liveness
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 099104e0af)
New `tractor.devx._proctitle` mod sets each
sub-actor's `argv[0]` (and kernel `comm`) to
`tractor[<aid.reprol()>]` — e.g.
`tractor[doggy@1027301b]` — so `ps`/`top`/`htop`
and `acli.pytree`/reaper tooling can identify
actors at a glance without parsing full cmdlines.
Deats,
- `set_actor_proctitle()` wraps the `setproctitle`
pkg with `ImportError` guard; optional at runtime
but listed in `pyproject.toml` so default installs
benefit.
- called early in `_child._actor_child_main()` after
`Actor` construction, before `_trio_main()` entry.
- tests in `tests/devx/test_proctitle.py`: format
unit test, `/proc/{cmdline,comm}` integration
test, negative detection test.
Resolves#457
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d60245777e)
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.
Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
`tractor._child` markers, falls back to `/proc/<pid>/comm` for
zombie-resilient detection (kernel preserves `comm` past exit
until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
— the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
`/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
`_is_tractor_subactor()` for intrinsic sub-actor ID instead of
cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
zombie-state sub-actors
Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
distinguish `system.slice` services vs `user.slice` desktop apps
from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
`user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
tractor importable from active venv
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 522b57570b)
Group all xontrib aliases under an `acli.` prefix
so xonsh prefix-completion treats them as a sub-cmd
group — `acli.<TAB>` lists the full set. No parent
`acli` cmd exists; the dot is purely naming.
Renames (incl `-` -> `_` in suffixes for shell-
identifier-friendliness):
- `pytree` -> `acli.pytree`
- `hung-dump` -> `acli.hung_dump`
- `bindspace-scan` -> `acli.bindspace_scan`
Add new `acli.reap` wrapping `scripts/tractor-reap`:
Deats,
- 3 opt-in phases via flags:
1. process reap — `find_orphans()` (default,
PPid=1 + cwd=repo + cmdline `python`) or
`find_descendants(--parent PID)`. SIGINT
first, SIGKILL after `--grace` (def 3.0s).
2. `/dev/shm` sweep (`--shm`/`--shm-only`) —
`find_orphaned_shm()` + `reap_shm()`. needed
bc `tractor` disables `mp.resource_tracker`.
3. UDS sock-file sweep (`--uds`/`--uds-only`) —
`find_orphaned_uds()` + `reap_uds()` for stale
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock`
entries. See #452.
- `--dry-run` lists matches without signalling/
unlinking; survivor pids or sweep errors flip
the alias rc to `1`.
- lazy-imports `tractor._testing._reap` after
`git rev-parse --show-toplevel` (with
`Path(__file__).parent.parent` fallback) so the
contrib is loadable before the venv is on
`sys.path`.
- `argparse.SystemExit` on `-h`/bad-args is
caught + returned as the alias rc instead of
killing xonsh.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit cec6cc2a56)
Extend `pytree` with two usability improvements:
- `--tree`/`-t` opt-in flag emits a flat walk-order `## tree` section at
the top preserving contiguous parent-child shape (no
severity-grouping), so the full tree structure is visible without
cross-ref'ing between severity buckets.
- Cross-bucket parent annotation: when a row's parent (by ppid) lives in
a *different* severity bucket, suffix with `[parent: <pid> (in
`<bucket>`)]` so the `└─` marker resolves even when bucketing scatters
parent/child into separate sections.
Also,
- split arg parsing into flag vs positional args.
- add `pid_to_bucket` dict + `walk_order` list to back both features
- rename inner `ppid` shadow to `ppid_str` to avoid collision with the
outer `ppid` variable.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0f4e671862)
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).
Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.
Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
filter rule + its rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit a6d4ac3aac)
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 01ce2857ea)
(factored: also default `find_orphans(repo_root=None)` -> cwd so the new bare call sites work ahead of the later intrinsic-identity rewrite)
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).
Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8de684f5de)
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
populated by `_do_capture_snapshot()` on each successful dump;
add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
prints all captured snapshot dirs at end-of-session so paths
don't get buried in scrollback
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit fb87c36263)
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
`tractor._child` procs NOT in the walk's `seen` set — surfaces
ghost subactor trees from prior test runs (cross-test launchpad
contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
into `_classify_walk()` closure, walk stray roots as additional
trees, emit stray-root summary in header. Also: `tractor._child`
procs reparented to init are now always classified as orphans
regardless of cgroup-slice (leaked subactor ≠ desktop-launched
app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
`--capture=sys` redirection so snapshot paths always land on the
real terminal
- `fail_after_w_trace()`: capture diag snapshot on
non-`TooSlowError` exceptions when the `fail_after` scope's
cancel had already fired (e.g. nursery wraps `Cancelled` into a
`BaseExceptionGroup` that escapes before `TooSlowError` can be
raised).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3a243a1fd4)
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.
Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
hung-state dumper, bindspace scanner, `dump_all()` snapshot
archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
(trio `fail_after` + auto-snapshot on `TooSlowError`),
`afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
`SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
`_testing.trace` — drops ~627 lines of inline impl. Add
`acli.dump_all` alias for full snapshot-bundle CLI access.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7509e313ff)
(factored: the xontrib-side move-out hunk rides with the diag-xontrib segment)
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:
- CI (`CI` env-var set): `pytest.exit(rc=2)` on
mismatch — forces every matrix-row to declare
`--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
experiment with `--capture=fd` to validate fixes.
Deats,
- drop `_cap_fd_set` global; add
`_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
spawner-name lookup
- move inline comment wall → proper docstring w/
Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
`request: pytest.FixtureRequest` and reads
`request.config.option.capture` instead of the
`_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
capture deadlocks)
- `set_fork_aware_capture()` returns the actual
capture mode str from config, not a hardcoded
`'sys'`
- lift `import warnings` to module level (was duped
inside `pytest_configure`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 255c9c3a7c)
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.
Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit e4953851de)
New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).
Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5cf0312c78)
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.
Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
wrapper's fixture injection.
Also,
- add `alert_on_finish` session fixture (terminal
bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
our plugin eventually however..
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d549c72052)
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.
Also,
- load `tractor._testing.pytest` via `-p` in ini
(bc bootstrapping hooks must register before
conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
and pop env var on disable.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 61d4525137)
Wire up `find_orphaned_uds()` + `reap_uds()` from
`_reap` as a new phase-3 UDS sweep in the CLI
script. Opt-in via `--uds` (run after proc reap +
shm) or `--uds-only` (skip other phases).
Also,
- consolidate skip-proc-reap logic into a single
`skip_proc_reap` bool covering both `--shm-only`
and `--uds-only`
- extend header docstring + usage examples
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0996a83655)
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:
- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
`reap_uds()` — detect `<name>@<pid>.sock` files under
`${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
lingering subactors, wait, SIGKILL survivors, then sweep orphaned
UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
snapshot sock-file dir before/after each test, warn + reap new
orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
with known-leaky teardown.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 1cdc7fb302)
`timeout = 200` was firing via SIGALRM (the default
`method='signal'`) which synchronously raises `Failed` in
trio's main thread mid-`epoll.poll()`, abandoning trio's
runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half-
installed. EVERY subsequent `trio.run()` in the same pytest
session then bails with
`RuntimeError: Attempted to call run() from inside a run()`.
Empirical impact: a session that hits a single 200s hang
cascades into 30-40 false-positive failures across every
downstream test file that uses `trio.run`. Recent UDS run
saw 1 real timeout (`test_unregistered_err_still_relayed`)
poison 38 sibling tests with cascade-fails — a debugging
nightmare.
Same architectural bug we already documented in
`tests/test_advanced_streaming.py::test_dynamic_pub_sub`
(see its module-level NOTE) — both `pytest-timeout`
enforcement modes are incompatible with trio under fork-
based spawn backends. Now scoped session-wide.
For tests that legitimately need a wall-clock cap, the
canonical pattern is `with trio.fail_after(N):` INSIDE the
test — trio's own `Cancelled` machinery cleanly unwinds
the actor nursery without disturbing global state.
For CI: rely on job-level wall-clock timeouts (e.g. GitHub
Actions `timeout-minutes`) to abort genuinely-stuck suites.
`pyproject.toml` comment block spells this all out so a
future contributor doesn't reach back for `timeout =` and
re-introduce the bug.
ALSO, bump `xonsh` to at least `0.23.0` release.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3c366cac13)
(factored: the xonsh pin/editable-source hunks already landed with the devenv segment)
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.
Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...
Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.
Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b376eb0332)
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".
Switch to per-call `random.randint()` salted with `os.getpid()`
so:
- within one session: two calls return distinct ports — e.g.
`test_tpt_bind_addrs::bind-subset-reg` now actually gets two
different reg addrs on the TCP backend (it was silently
duplicating before),
- across parallel sessions: pid salt biases each process's
port choices apart, making cross-run collisions
vanishingly rare.
Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7c5dd4d033)
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.
Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
+ `reap_shm()` helpers. Match criteria: regular file
under `/dev/shm`, owned by current uid, AND no live
proc has it open (mmap'd or fd-held). In-use
enumeration via `psutil.Process.memory_maps()` +
`.open_files()` — xplatform, kernel-canonical (same
answer `lsof` would give), no reliance on
tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
`NotImplementedError` outside Linux/FreeBSD bc macOS
POSIX shm has no fs-visible path (`shm_open` only)
and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
process reap) and `--shm-only` (skip process phase)
flags. `-n` dry-runs both phases. Exit code is `1`
if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
the `testing` dep group; lazy-imported in `_reap.py`
so the process-reap path stays import-clean without
it.
Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
(new section 10c) — covers match criteria + the
preservation guarantee for unrelated apps.
- flip mitigation status in
`subint_forkserver_mp_shared_memory_issue.md` from
"could extend `tractor-reap`" to "implemented", with
a note that callers should still UUID-pin shm keys to
avoid cross-session collisions.
Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4f12d69b41)
(factored: dropped subint_forkserver conc-anal doc update)