Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
`track_orphaned_uds_per_test`,
`detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
MTF `xfail(strict=False)` with detailed race-window comment
explaining the BEG shape mismatch, wrap body in
`fail_after_w_trace` with per-backend timeout budget, bump
`@tractor_test(timeout=10)`, drop old multiprocessing depth
special-casing
- `test_multierror_fast_nursery`: wrap in
`fail_after_w_trace(30.0)`, accept `TooSlowError` in
`pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
`spawn_backend` param for `is_forking_spawner`, widen
`fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
`test_cancel_infinite_streamer`, `test_some_cancels_all`: add
`set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
covered by module-level `pytestmark`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Append "Snapshot evidence (2026-05-13)" section to
`cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`
documenting `fail_after_w_trace` diag capture results for
`test_nested_multierrors` under the MTF backend — reproduction cmd,
ptree analysis, observed hang signature, and updated triage plan.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).
Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
populated by `_do_capture_snapshot()` on each successful dump;
add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
prints all captured snapshot dirs at end-of-session so paths
don't get buried in scrollback
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
`tractor._child` procs NOT in the walk's `seen` set — surfaces
ghost subactor trees from prior test runs (cross-test launchpad
contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
into `_classify_walk()` closure, walk stray roots as additional
trees, emit stray-root summary in header. Also: `tractor._child`
procs reparented to init are now always classified as orphans
regardless of cgroup-slice (leaked subactor ≠ desktop-launched
app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
`--capture=sys` redirection so snapshot paths always land on the
real terminal
- `fail_after_w_trace()`: capture diag snapshot on
non-`TooSlowError` exceptions when the `fail_after` scope's
cancel had already fired (e.g. nursery wraps `Cancelled` into a
`BaseExceptionGroup` that escapes before `TooSlowError` can be
raised).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.
Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
hung-state dumper, bindspace scanner, `dump_all()` snapshot
archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
(trio `fail_after` + auto-snapshot on `TooSlowError`),
`afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
`SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
`_testing.trace` — drops ~627 lines of inline impl. Add
`acli.dump_all` alias for full snapshot-bundle CLI access.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- `test_echoserver_detailed_mechanics`: add `is_forking_spawner`
param, wrap `main()` in `fa_main()` with per-backend
`trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade
teardown that compounds under forkserver.
- `test_sigint_closes_lifetime_stack`: swap `start_method` param
for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so
KBI firing before `open_context` body doesn't `UnboundLocalError`,
add `pytest.fail` guard for the spawn-time IPC race case, arm
`signal.alarm` AFK-safety cap (10s) under fork backends
Also,
- `pytestmark`: add `track_orphaned_uds_per_test` +
`detect_runaway_subactors_per_test` fixtures.
- `delay()`: hardcode `return 1e3` at top (debug override still in
place).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
For forking spawner backends that is.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- `skipon_spawn_backend('subint')`: expand reason with specific
analysis doc refs + GH issue #379 umbrella link.
- add `track_orphaned_uds_per_test` fixture via `usefixtures` to
blame-attribute UDS sock-file orphans left by SIGKILL cancel
cascades.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with
`trio.fail_after(2)` + commented `capfd.disabled()` investigation
(pytest#14444).
- `test_basic_payload_spec`: add fixture param with note on fork-spawner
hang prevention.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Outer `signal.alarm` cap that fires even when trio's
`fail_after` is blocked by a shielded-await deadlock
(the bug-class-3 hang under MTF backends). Only armed
for fork-based spawners where the bug lives.
Deats,
- `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the
trio-native guard so it always loses when the in-band path works.
- `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the
last-fired breadcrumb names the swallow point on hang.
- try/finally wrapping around each scope level for deterministic
breadcrumb emission.
- add `is_forking_spawner`, `set_fork_aware_capture` fixture params.
- rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12).
Also,
- `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add
TODO re `pytest.raises()`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Split the old `live`/`orphans` sock classification
into three ppid-aware buckets: `live-active` (PID
alive, parent owns it), `orphaned-alive` (PID alive
but `ppid==1`, init-adopted — `acli.reap` candidate),
and `orphaned-dead` (PID gone, sock stale).
Deats,
- new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent
PID, handles the tricky `(comm)` field (can contain spaces/parens) by
splitting from last `)`.
- live-active rows now show `(ppid=<N>)` for ctx.
- orphaned-alive rows flagged `(adopted by init)`.
- cleanup suggestion: `acli.reap --uds` for both
alive-orphan graceful cancel + dead-sock cleanup
in one shot; manual `rm` kept as fallback.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `capture` dimension to CI matrix so fork-based
backends run `--capture=sys` (fork-child × `--capture=fd`
is a known deadlock). Non-fork backends keep `fd`.
Deats,
- two `include:` rows for `main_thread_forkserver` on
linux py3.13: tcp + uds, both `capture: 'sys'`
- job name updated to show `capture=` mode
- timeout bumped 16 -> 20 min to accommodate the
additional matrix cells
- `--capture=${{ matrix.capture }}` replaces hardcoded
`--capture=fd`
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md`
documenting the spawn-time rc=2 race under rapid
same-name spawning against a forkserver + registrar
— the `wait_for_peer_or_proc_death` helper now surfaces
the death instead of parking forever on the handshake
wait.
Also,
- extract inline `xfail` into module-level
`_DOGGY_BOOT_RACE_XFAIL` marker.
- apply it to `n_dups=8` too (previously bare) bc
larger N widens the race window enough to fire
occasionally.
- link to tracking issue #456.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Forking spawner + UDS transport has different timing
vs `trio_proc` — streaming example completes faster
in some cases, slower in others depending on fork
overhead + sock setup.
Deats,
- add `expect_cancel` param to `cancel_after()`, raise
`ActorTooSlowError` when cancel scope fires unexpectedly instead of
silently returning `None`.
- `time_quad_ex` fixture: bump timeout +1 for forking+UDS, explicit
`ActorTooSlowError` on `None` result instead of bare `assert results`.
- `test_not_fast_enough_quad`: `xfail` for forking+UDS being "too fast"
(cancel doesn't fire bc streaming finishes before delay).
- add `is_forking_spawner`, `tpt_proto` fixture params throughout.
Also,
- `_testing/pytest.py`: widen `start_method` parametrize and
`is_forking_spawner` fixture to `scope='session'`.
- `"""` -> `'''` docstring style throughout.
- hoist `_non_linux` to module scope (was redefined locally in two
places).
- type hints, kwarg-style `partial()` calls.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`acli.bindspace_scan piker` now resolves `<name>` to
`$XDG_RUNTIME_DIR/<name>` — useful for projects like
`piker` that bind sibling sub-dirs alongside tractor's
default. Full paths still work as-is.
Also,
- rename "unparseable" section to "non-tractor" with
clearer desc (filename lacks `@<pid>` suffix)
- print per-sock `ss -lpx 'src = <path>'` cmds for
non-tractor socks so callers can manually resolve
listener-PID liveness
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add module-level `pytestmark` applying per-test
`reap_subactors_per_test`, `track_orphaned_uds_per_test`, and
`detect_runaway_subactors_per_test` fixtures — registrar tests stress
discovery roundtrips that historically left orphaned UDS sock-files.
Deats,
- drop unused `say_hello()` fn, keep only `say_hello_use_wait`;
rename param `func` -> `ria_fn`.
- use `@tractor_test(timeout=7)` instead of separate
`@pytest.mark.timeout(7, method='thread')` decorator.
- add `with_timeout()` helper, wire into
`test_subactors_unregister_on_cancel_remote_daemon`.
- uncomment `_timeout_main()` in `test_stale_entry_is_deleted`, use
configurable `timeout` var + `debug_mode` guard for `tractor.pause()`
on cancel.
- `dump_on_hang(seconds=timeout*2)` instead of hardcoded `20`.
- fix typo "oustanding" -> "outstanding".
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Comment/docstring updates: `subint_forkserver` is a clean
`NotImplementedError` stub — not an alias to variant-1
(`main_thread_forkserver`). Key reserved in-place (not aliased) so
the subint-hosted-child impl can flip without API churn once
jcrist/msgspec#1026 unblocks PEP 684 subints.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.
Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
`tractor._child` markers, falls back to `/proc/<pid>/comm` for
zombie-resilient detection (kernel preserves `comm` past exit
until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
— the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
`/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
`_is_tractor_subactor()` for intrinsic sub-actor ID instead of
cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
zombie-state sub-actors
Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
distinguish `system.slice` services vs `user.slice` desktop apps
from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
`user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
tractor importable from active venv
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `tractor.devx._proctitle` mod sets each
sub-actor's `argv[0]` (and kernel `comm`) to
`tractor[<aid.reprol()>]` — e.g.
`tractor[doggy@1027301b]` — so `ps`/`top`/`htop`
and `acli.pytree`/reaper tooling can identify
actors at a glance without parsing full cmdlines.
Deats,
- `set_actor_proctitle()` wraps the `setproctitle`
pkg with `ImportError` guard; optional at runtime
but listed in `pyproject.toml` so default installs
benefit.
- called early in `_child._actor_child_main()` after
`Actor` construction, before `_trio_main()` entry.
- tests in `tests/devx/test_proctitle.py`: format
unit test, `/proc/{cmdline,comm}` integration
test, negative detection test.
Resolves#457
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Extend `test_register_duplicate_name` w/ cancel-level log
breadcrumbs and `try/finally` for better diag on the cancel-cascade
hang.
Add `test_dup_name_cancel_cascade_escalates_to_hard_kill` as a
regression test for the TCP+MTF duplicate-name cancel-cascade
deadlock. Spawns N same-name actors, calls `an.cancel()`, and
asserts teardown completes within a `trio.fail_after()` budget that
scales w/ `n_dups`.
Deats,
- parametrize `n_dups` (2, 4, 8) to widen the race window for
concurrent `register_actor` RPCs.
- `n_dups=4` xfail'd — exposes a separate boot-race bug (doggy
`rc=2` under rapid same-name spawn), tracked in #456.
- post-teardown asserts all `Portal` chans disconnect, verifying
hard-kill escalation worked.
Relates to https://github.com/goodboy/tractor/issues/456
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Race `IPCServer.wait_for_peer(uid)` against the sub-proc's
`.wait()` inside a `trio` nursery; whichever completes first
cancels the other.
Prevents the spawning task from parking forever on an unsignalled
`_peer_connected[uid]` event when a sub-actor dies during boot
(e.g. crashed on import before reaching `_actor_child_main`).
Instead of hanging, raises `ActorFailure` w/ the proc's exit code
for clean supervisor error reporting.
Also,
- use the new racer in `main_thread_forkserver_proc()` spawn path.
- keep `proc_wait` generic so each backend passes its own callable
(`trio.Process.wait`, `_ForkedProc.wait`, etc.).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Group all xontrib aliases under an `acli.` prefix
so xonsh prefix-completion treats them as a sub-cmd
group — `acli.<TAB>` lists the full set. No parent
`acli` cmd exists; the dot is purely naming.
Renames (incl `-` -> `_` in suffixes for shell-
identifier-friendliness):
- `pytree` -> `acli.pytree`
- `hung-dump` -> `acli.hung_dump`
- `bindspace-scan` -> `acli.bindspace_scan`
Add new `acli.reap` wrapping `scripts/tractor-reap`:
Deats,
- 3 opt-in phases via flags:
1. process reap — `find_orphans()` (default,
PPid=1 + cwd=repo + cmdline `python`) or
`find_descendants(--parent PID)`. SIGINT
first, SIGKILL after `--grace` (def 3.0s).
2. `/dev/shm` sweep (`--shm`/`--shm-only`) —
`find_orphaned_shm()` + `reap_shm()`. needed
bc `tractor` disables `mp.resource_tracker`.
3. UDS sock-file sweep (`--uds`/`--uds-only`) —
`find_orphaned_uds()` + `reap_uds()` for stale
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock`
entries. See #452.
- `--dry-run` lists matches without signalling/
unlinking; survivor pids or sweep errors flip
the alias rc to `1`.
- lazy-imports `tractor._testing._reap` after
`git rev-parse --show-toplevel` (with
`Path(__file__).parent.parent` fallback) so the
contrib is loadable before the venv is on
`sys.path`.
- `argparse.SystemExit` on `-h`/bad-args is
caught + returned as the alias rc instead of
killing xonsh.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:
graceful cancel-req -> bounded wait -> hard-kill
Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
of the legacy DEBUG-log + return-`False` path. Default stays `False`
for callers that handle their own escalation (e.g.
`_spawn.soft_kill()` polling `proc.poll()`).
- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
(SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
waiting on `proc.poll()`.
- replace `tn.start_soon(portal.cancel_actor)` in
`ActorNursery.cancel()` with the helper.
Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
opt-in).
ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.
Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Distinct from `trio.TooSlowError` so that existing
`except trio.TooSlowError:` blocks don't silently
mask actor-cancel timeouts — these must propagate
to let a supervisor escalate to
`proc.terminate()` per SC-discipline:
graceful cancel-req -> bounded wait -> hard-kill
Motivated by #subint_forkserver dup-name hang
where `Portal.cancel_actor()` silently swallowed
the timeout and the supervisor never escalated,
leaving a same-named sibling subactor parked
forever.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Sends `SIGTERM` (graceful shutdown) instead of the existing `kill()`
which sends `SIGKILL`. Mirrors the `trio.Process.terminate()`
/ `multiprocessing.Process.terminate()` interface.
Used by `ActorNursery.cancel()`'s per-child escalation when
`Portal.cancel_actor()` raises `ActorTooSlowError`, and by the legacy
`hard_kill=True` branch. Swallows `ProcessLookupError` (child already
dead) same as `kill()`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Pre-compute `mismatch_lines` str instead of `+`-concat
inside the f-string raise site; slightly easier to read
and avoids the `+ '\n\n'` continuation.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:
- CI (`CI` env-var set): `pytest.exit(rc=2)` on
mismatch — forces every matrix-row to declare
`--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
experiment with `--capture=fd` to validate fixes.
Deats,
- drop `_cap_fd_set` global; add
`_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
spawner-name lookup
- move inline comment wall → proper docstring w/
Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
`request: pytest.FixtureRequest` and reads
`request.config.option.capture` instead of the
`_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
capture deadlocks)
- `set_fork_aware_capture()` returns the actual
capture mode str from config, not a hardcoded
`'sys'`
- lift `import warnings` to module level (was duped
inside `pytest_configure`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Extend `pytree` with two usability improvements:
- `--tree`/`-t` opt-in flag emits a flat walk-order `## tree` section at
the top preserving contiguous parent-child shape (no
severity-grouping), so the full tree structure is visible without
cross-ref'ing between severity buckets.
- Cross-bucket parent annotation: when a row's parent (by ppid) lives in
a *different* severity bucket, suffix with `[parent: <pid> (in
`<bucket>`)]` so the `└─` marker resolves even when bucketing scatters
parent/child into separate sections.
Also,
- split arg parsing into flag vs positional args.
- add `pid_to_bucket` dict + `walk_order` list to back both features
- rename inner `ppid` shadow to `ppid_str` to avoid collision with the
outer `ppid` variable.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Raise `ValueError` from `open_root_actor()` when any
`registry_addrs` entry uses a transport proto not in
`enable_transports` — historically this caused a
silent indefinite hang during the registrar handshake
(the actor could never connect to register/discover).
Also,
- update `test_root_passes_tpt_to_sub` to detect a
proto mismatch between parametrized `tpt_proto_key`
and CLI `tpt_proto`, asserting the new guard raises
`ValueError` with expected msg content.
- replace old commented-out notes with a clearer
explanation of the mismatch foot-gun.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Draft plan for consolidating pytest CLI flags,
ad-hoc env vars, and hardcoded fixture defaults
into the existing (but unused) `RuntimeVars`
struct as the single source of truth.
Deats,
- `_rtvars.py` leaf mod w/ `dump`/`load`/`get`/
`update` helpers using `str(dict)` +
`ast.literal_eval` encoding
- phased migration: test infra first, then
runtime callers, then per-session bindspace
- addresses concurrent pytest session collisions
and subproc env propagation for `devx/` scripts
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError`
guard — under concurrent pytest sessions the sock-file can already be
reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before
`_shutdown.set()`, deadlocking `wait_for_shutdown()` on
`actor.cancel()`.
Also,
- close each endpoint independently in the finally so one raise doesn't
strand the rest.
- always signal `_shutdown.set()` regardless of remaining ep count.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.
Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
All `daemon` fixture consumers are discovery-
protocol tests now living under `tests/discovery/`.
Move the fixture, its `_wait_for_daemon_ready`
helper, and `test_multi_program.py` into that subdir
so scope matches usage.
Also,
- add `pytestmark` for `track_orphaned_uds_per_test`
+ `detect_runaway_subactors_per_test` to `test_multi_program` as
regression net.
- drop now-unused `_PROC_SPAWN_WAIT` + `socket` import from root
conftest.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
First draft at resolving,
https://github.com/goodboy/tractor/issues/424
`tests.conftest.py.daemon()` previously used a blind
`time.sleep(_PROC_SPAWN_WAIT + uds_bonus + ci_bonus)` to "wait for the
daemon to come up" before yielding the proc to the test.
Two problems:
1. **Racy under load** — sleep is fixed at design time; loaded boxes
/ cold starts / fork-spawn cost spikes blow past it, leading to
`ConnectionRefusedError` /`OSError: connect failed` flakes in
`test_register_duplicate_name`.
2. **Wasteful when daemon comes up fast** — happy-path pays the FULL
sleep regardless. ~3s of dead time per fixture invocation, ~10-20s
per full suite run.
Replace with `_wait_for_daemon_ready()` — active poll via stdlib
`socket.create_connection` (TCP) or `socket.connect` (UDS) on the
daemon's bind addr, with 50ms backoff and a 10s/15s deadline (CI gets
extra headroom). Daemon-died-during-startup early-exit catches the case
where `_PROC_SPAWN_WAIT` was silently masking daemon startup crashes.
Why stdlib `socket` (Option 2 from the conc-anal doc) instead of
`tractor`'s own `_root.ping_tpt_socket` closure or trio?
- `tractor.run_daemon()` doesn't return from bootstrap until the runtime
is fully ready to handle IPC, so probing listen-side acceptance is
sufficient.
- no need to do the full IPC handshake just to validate readiness.
Sidesteps the `trio.run()` bootstrap cost (~50ms) per fixture too.
`claude`'s verification: 10/10 runs of `tests/test_multi_program.py`
pass on both `--tpt-proto=tcp` and `--tpt-proto=uds`. Per-test wall-time
`test_register_duplicate_name`: 4.31s → 1.10s. Full file: ~12s → 3.27s
per transport.
Doc-tracked at:
`ai/conc-anal/test_register_duplicate_name_daemon_connect_race_issue.md`
Future work — session-scoped trio runtime in a bg thread to share
fixture-side trio operations across many fixtures (currently overkill
for the one fixture that needs it).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Document the intermittent connect-refused failure in the registrar
daemon test — root cause is the `daemon` fixture's blind `time.sleep()`
readiness gate racing against the subproc's `bind()`/ `listen()`
completion. Distinct from the cancel- cascade `TooSlowError` flake
class.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Use `is_forking_spawner` fixture + gate spawner-
specific expect patterns in nested-error and daemon
tests. Add `set_fork_aware_capture` to multi-sub
tests that need capture-mode awareness.
Deats,
- replace `start_method` param with `is_forking_spawner` bool fixture.
- bump inter-send delay to 0.1s for IPC stability under fork backends.
- gate `bdb.BdbQuit` + relay-uid patterns behind `not
is_forking_spawner` (not visible under capsys).
- add `expect(child, EOF)` to confirm clean exit.
- switch caught exc from `AssertionError` to `ValueError` in daemon
test.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
In top level `daemon`-fixture that is..
Use a local `bg_daemon_spawn_delay` instead of
mutating the module-level `_PROC_SPAWN_WAIT` —
previously each `daemon` fixture invocation would
permanently add 1.6s (UDS) or 1s (CI) to the
global, inflating delays across the session.
Also, emit a `test_log.warning()` when verbose
loglevel is silently reduced to `'info'`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Document the ~0.3% rotating `trio.TooSlowError`
flake under `--spawn-backend=main_thread_forkserver`
full-suite runs. Root cause: `hard_kill`'s per-sub
1.6s graceful timeout compounding across N subactors
in a cancel cascade, plus cumulative autouse-reaper
teardown overhead.
Covers symptom, observed flaking tests, root-cause
family, ranked mitigations (cap bump -> CPU-count-
aware cap -> `pytest-rerunfailures` -> `hard_kill`
tuning -> targeted profiling), and a verification
protocol.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.
New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.
Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.
First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
& `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
instance whose write-end is closed, and `drain()` then **spins forever
in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.
Standalone repro:
from trio._core._wakeup_socketpair import WakeupSocketpair
ws = WakeupSocketpair()
ws.write_sock.close()
ws.drain() # spins forever
Patch is one-line — break the drain loop on b'' EOF.
Manifested as two distinct test failures:
- `tests/test_multi_program.py::test_register_duplicate_name` hung at
100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
both threads parked in `epoll_wait`, no TCP connect-back to parent
ever happened.
Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).
Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.
Conc-anal write-ups, including strace + py-spy evidence:
- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.
TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
body's evidence section.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Inside a new new `tractor.spawn._reap` submod which kicks off providing
post-mortem subactor cleanup primitives, parent-side; consider it the
"sibling" of `tractor._testing._reap` which is the test-harness-oriented
brother mod.
Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454
where `hard_kill()`'s `SIGKILL` bypasses the subactor's
`_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files..
This adds 2 cleanup paths:
- explicit `bind_addrs` (when set at spawn time),
OR
- convention-based reconstruction from `subactor.aid.name + proc.pid`
for the random-self-assign case.
`.spawn.hard_kill()` now invokes the cleanup unconditionally
post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError`
skip.
Future work — authoritative tracking via a per-process
UDS bind-addr registry — documented in module docstring,
deferred to a follow-up PR.
Co-fix: `tractor/spawn/_trio.py::new_proc` already passes
`bind_addrs` + `subactor` to `hard_kill` via prior work
on this branch.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).
Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.
Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
wrapper's fixture injection.
Also,
- add `alert_on_finish` session fixture (terminal
bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
our plugin eventually however..
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code