Commit Graph

2679 Commits (caebf60f4ebf78d0f3c7069d92fb5cff5d9c5c40)

Author SHA1 Message Date
Gud Boi caebf60f4e Add dup-name cancel-cascade escalation test
Extend `test_register_duplicate_name` w/ cancel-level log
breadcrumbs and `try/finally` for better diag on the cancel-cascade
hang.

Add `test_dup_name_cancel_cascade_escalates_to_hard_kill` as a
regression test for the TCP+MTF duplicate-name cancel-cascade
deadlock. Spawns N same-name actors, calls `an.cancel()`, and
asserts teardown completes within a `trio.fail_after()` budget that
scales w/ `n_dups`.

Deats,
- parametrize `n_dups` (2, 4, 8) to widen the race window for
  concurrent `register_actor` RPCs.
- `n_dups=4` xfail'd — exposes a separate boot-race bug (doggy
  `rc=2` under rapid same-name spawn), tracked in #456.
- post-teardown asserts all `Portal` chans disconnect, verifying
  hard-kill escalation worked.

Relates to https://github.com/goodboy/tractor/issues/456

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 23:33:23 -04:00
Gud Boi 3b0724eba8 Add `wait_for_peer_or_proc_death()` to `_spawn`
Race `IPCServer.wait_for_peer(uid)` against the sub-proc's
`.wait()` inside a `trio` nursery; whichever completes first
cancels the other.

Prevents the spawning task from parking forever on an unsignalled
`_peer_connected[uid]` event when a sub-actor dies during boot
(e.g. crashed on import before reaching `_actor_child_main`).
Instead of hanging, raises `ActorFailure` w/ the proc's exit code
for clean supervisor error reporting.

Also,
- use the new racer in `main_thread_forkserver_proc()` spawn path.
- keep `proc_wait` generic so each backend passes its own callable
  (`trio.Process.wait`, `_ForkedProc.wait`, etc.).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 22:18:29 -04:00
Gud Boi cec6cc2a56 Add `acli.reap`, namespace `tractor_diag` cmds
Group all xontrib aliases under an `acli.` prefix
so xonsh prefix-completion treats them as a sub-cmd
group — `acli.<TAB>` lists the full set. No parent
`acli` cmd exists; the dot is purely naming.

Renames (incl `-` -> `_` in suffixes for shell-
identifier-friendliness):

  - `pytree`         -> `acli.pytree`
  - `hung-dump`      -> `acli.hung_dump`
  - `bindspace-scan` -> `acli.bindspace_scan`

Add new `acli.reap` wrapping `scripts/tractor-reap`:

Deats,
- 3 opt-in phases via flags:

  1. process reap — `find_orphans()` (default,
     PPid=1 + cwd=repo + cmdline `python`) or
     `find_descendants(--parent PID)`. SIGINT
     first, SIGKILL after `--grace` (def 3.0s).

  2. `/dev/shm` sweep (`--shm`/`--shm-only`) —
     `find_orphaned_shm()` + `reap_shm()`. needed
     bc `tractor` disables `mp.resource_tracker`.

  3. UDS sock-file sweep (`--uds`/`--uds-only`) —
     `find_orphaned_uds()` + `reap_uds()` for stale
     `${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock`
     entries. See #452.

- `--dry-run` lists matches without signalling/
  unlinking; survivor pids or sweep errors flip
  the alias rc to `1`.
- lazy-imports `tractor._testing._reap` after
  `git rev-parse --show-toplevel` (with
  `Path(__file__).parent.parent` fallback) so the
  contrib is loadable before the venv is on
  `sys.path`.
- `argparse.SystemExit` on `-h`/bad-args is
  caught + returned as the alias rc instead of
  killing xonsh.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 18:07:34 -04:00
Gud Boi 34f333a026 Escalate cancel-ack timeouts to `proc.terminate()`
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:

  graceful cancel-req -> bounded wait -> hard-kill

Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
  When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
  of the legacy DEBUG-log + return-`False` path. Default stays `False`
  for callers that handle their own escalation (e.g.
  `_spawn.soft_kill()` polling `proc.poll()`).

- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
  cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
  (SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
  waiting on `proc.poll()`.

- replace `tn.start_soon(portal.cancel_actor)` in
  `ActorNursery.cancel()` with the helper.

Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
  REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
  opt-in).

ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.

Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 18:01:59 -04:00
Gud Boi 38ffb875bd Add `ActorTooSlowError` for cancel-cascade timeouts
Distinct from `trio.TooSlowError` so that existing
`except trio.TooSlowError:` blocks don't silently
mask actor-cancel timeouts — these must propagate
to let a supervisor escalate to
`proc.terminate()` per SC-discipline:

  graceful cancel-req -> bounded wait -> hard-kill

Motivated by #subint_forkserver dup-name hang
where `Portal.cancel_actor()` silently swallowed
the timeout and the supervisor never escalated,
leaving a same-named sibling subactor parked
forever.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 16:39:10 -04:00
Gud Boi 4c00913b3b Add `terminate()` to `_ForkedProc`
Sends `SIGTERM` (graceful shutdown) instead of the existing `kill()`
which sends `SIGKILL`. Mirrors the `trio.Process.terminate()`
/ `multiprocessing.Process.terminate()` interface.

Used by `ActorNursery.cancel()`'s per-child escalation when
`Portal.cancel_actor()` raises `ActorTooSlowError`, and by the legacy
`hard_kill=True` branch. Swallows `ProcessLookupError` (child already
dead) same as `kill()`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 16:35:18 -04:00
Gud Boi 5cd06810db Tidy proto-guard `ValueError` fmt in `open_root_actor()`
Pre-compute `mismatch_lines` str instead of `+`-concat
inside the f-string raise site; slightly easier to read
and avoids the `+ '\n\n'` continuation.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 16:24:23 -04:00
Gud Boi 255c9c3a7c Mk `--capture` guard CI-aware w/ local warn
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:

- CI (`CI` env-var set): `pytest.exit(rc=2)` on
  mismatch — forces every matrix-row to declare
  `--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
  experiment with `--capture=fd` to validate fixes.

Deats,
- drop `_cap_fd_set` global; add
  `_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
  spawner-name lookup
- move inline comment wall → proper docstring w/
  Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
  `request: pytest.FixtureRequest` and reads
  `request.config.option.capture` instead of the
  `_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
  capture deadlocks)
- `set_fork_aware_capture()` returns the actual
  capture mode str from config, not a hardcoded
  `'sys'`
- lift `import warnings` to module level (was duped
  inside `pytest_configure`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-07 16:17:13 -04:00
Gud Boi 0f4e671862 Add `--tree` flag and cross-bucket parent annos to `pytree`
Extend `pytree` with two usability improvements:

- `--tree`/`-t` opt-in flag emits a flat walk-order `## tree` section at
  the top preserving contiguous parent-child shape (no
  severity-grouping), so the full tree structure is visible without
  cross-ref'ing between severity buckets.

- Cross-bucket parent annotation: when a row's parent (by ppid) lives in
  a *different* severity bucket, suffix with `[parent: <pid> (in
  `<bucket>`)]` so the `└─` marker resolves even when bucketing scatters
  parent/child into separate sections.

Also,
- split arg parsing into flag vs positional args.
- add `pid_to_bucket` dict + `walk_order` list to back both features
- rename inner `ppid` shadow to `ppid_str` to avoid collision with the
  outer `ppid` variable.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-06 19:04:55 -04:00
Gud Boi d036ef7d7f Add `enable_transports`/`registry_addrs` proto guard
Raise `ValueError` from `open_root_actor()` when any
`registry_addrs` entry uses a transport proto not in
`enable_transports` — historically this caused a
silent indefinite hang during the registrar handshake
(the actor could never connect to register/discover).

Also,
- update `test_root_passes_tpt_to_sub` to detect a
  proto mismatch between parametrized `tpt_proto_key`
  and CLI `tpt_proto`, asserting the new guard raises
  `ValueError` with expected msg content.
- replace old commented-out notes with a clearer
  explanation of the mismatch foot-gun.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-06 15:13:02 -04:00
Gud Boi 7882c37ce0 Add `RuntimeVars` env-var lift design plan
Draft plan for consolidating pytest CLI flags,
ad-hoc env vars, and hardcoded fixture defaults
into the existing (but unused) `RuntimeVars`
struct as the single source of truth.

Deats,
- `_rtvars.py` leaf mod w/ `dump`/`load`/`get`/
  `update` helpers using `str(dict)` +
  `ast.literal_eval` encoding
- phased migration: test infra first, then
  runtime callers, then per-session bindspace
- addresses concurrent pytest session collisions
  and subproc env propagation for `devx/` scripts

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-06 15:02:13 -04:00
Gud Boi 2ee44a6fdd Fix shutdown deadlock on UDS unlink race
Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError`
guard — under concurrent pytest sessions the sock-file can already be
reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before
`_shutdown.set()`, deadlocking `wait_for_shutdown()` on
`actor.cancel()`.

Also,
- close each endpoint independently in the finally so one raise doesn't
  strand the rest.
- always signal `_shutdown.set()` regardless of remaining ep count.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-06 14:11:51 -04:00
Gud Boi 7b14fdcd96 Add `tractor_diag`(nosis) xontrib with aliases
Xonsh xontrib providing three diagnostic commands
for tractor development / hang investigation:

- `pytree <pid|pat>` — psutil-backed proc tree with severity-bucketed
  output (zombies > orphans > live), tree-depth markers, zombie-safe
  rendering.
- `hung-dump <pid|pat>` — kernel `wchan`/`stack` + `py-spy dump
  --locals` per descendant, sudo-cred caching upfront, pgrep fallback
  when psutil absent.
- `bindspace-scan [<dir>]` — scan UDS bindspace for orphaned
  `<name>@<pid>.sock` files whose binder pid is dead, emit `rm`
  one-liner for cleanup.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-06 14:07:24 -04:00
Gud Boi e4953851de Mk per-test reap fixtures opt-in
Rename `_track_orphaned_uds_per_test` and
`_detect_runaway_subactors_per_test` to public names (drop `_` prefix),
drop `autouse=True`. Tests that need per-test reap blame now opt in via
`pytestmark = pytest.mark.usefixtures(...)`.

Also,
- reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper
  per pid.
- add empty-`only_pids` fast-path in `find_runaway_subactors` to skip
  psutil import when no descendants were spawned.
- extract `new_pids` intermediate var for clarity.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-06 13:29:49 -04:00
Gud Boi c4082be876 Mv `daemon` + `test_multi_program` to `discovery/`
All `daemon` fixture consumers are discovery-
protocol tests now living under `tests/discovery/`.
Move the fixture, its `_wait_for_daemon_ready`
helper, and `test_multi_program.py` into that subdir
so scope matches usage.

Also,
- add `pytestmark` for `track_orphaned_uds_per_test`
  + `detect_runaway_subactors_per_test` to `test_multi_program` as
    regression net.
- drop now-unused `_PROC_SPAWN_WAIT` + `socket` import from root
  conftest.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-06 13:23:42 -04:00
Gud Boi ec8c4659c4 Replace sleep with active poll in `daemon` fixture
First draft at resolving,
https://github.com/goodboy/tractor/issues/424

`tests.conftest.py.daemon()` previously used a blind
`time.sleep(_PROC_SPAWN_WAIT + uds_bonus + ci_bonus)` to "wait for the
daemon to come up" before yielding the proc to the test.

Two problems:

1. **Racy under load** — sleep is fixed at design time; loaded boxes
   / cold starts / fork-spawn cost spikes blow past it, leading to
   `ConnectionRefusedError` /`OSError: connect failed` flakes in
   `test_register_duplicate_name`.

2. **Wasteful when daemon comes up fast** — happy-path pays the FULL
   sleep regardless. ~3s of dead time per fixture invocation, ~10-20s
   per full suite run.

Replace with `_wait_for_daemon_ready()` — active poll via stdlib
`socket.create_connection` (TCP) or `socket.connect` (UDS) on the
daemon's bind addr, with 50ms backoff and a 10s/15s deadline (CI gets
extra headroom). Daemon-died-during-startup early-exit catches the case
where `_PROC_SPAWN_WAIT` was silently masking daemon startup crashes.

Why stdlib `socket` (Option 2 from the conc-anal doc) instead of
`tractor`'s own `_root.ping_tpt_socket` closure or trio?

- `tractor.run_daemon()` doesn't return from bootstrap until the runtime
  is fully ready to handle IPC, so probing listen-side acceptance is
  sufficient.
- no need to do the full IPC handshake just to validate readiness.
  Sidesteps the `trio.run()` bootstrap cost (~50ms) per fixture too.

`claude`'s verification: 10/10 runs of `tests/test_multi_program.py`
pass on both `--tpt-proto=tcp` and `--tpt-proto=uds`. Per-test wall-time
`test_register_duplicate_name`: 4.31s → 1.10s. Full file: ~12s → 3.27s
per transport.

Doc-tracked at:
`ai/conc-anal/test_register_duplicate_name_daemon_connect_race_issue.md`

Future work — session-scoped trio runtime in a bg thread to share
fixture-side trio operations across many fixtures (currently overkill
for the one fixture that needs it).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 20:03:41 -04:00
Gud Boi 29f9928524 Add `test_register_duplicate_name` race analysis
Document the intermittent connect-refused failure in the registrar
daemon test — root cause is the `daemon` fixture's blind `time.sleep()`
readiness gate racing against the subproc's `bind()`/ `listen()`
completion. Distinct from the cancel- cascade `TooSlowError` flake
class.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 20:01:08 -04:00
Gud Boi 086e9f2c07 Use single f-string per pid in runaway warning
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 19:58:11 -04:00
Gud Boi 9031605807 Harden `test_debugger` for forkserver spawners
Use `is_forking_spawner` fixture + gate spawner-
specific expect patterns in nested-error and daemon
tests. Add `set_fork_aware_capture` to multi-sub
tests that need capture-mode awareness.

Deats,
- replace `start_method` param with `is_forking_spawner` bool fixture.
- bump inter-send delay to 0.1s for IPC stability under fork backends.
- gate `bdb.BdbQuit` + relay-uid patterns behind `not
  is_forking_spawner` (not visible under capsys).
- add `expect(child, EOF)` to confirm clean exit.
- switch caught exc from `AssertionError` to `ValueError` in daemon
  test.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 19:21:49 -04:00
Gud Boi c4885f9d99 Drop global mutation of `_PROC_SPAWN_WAIT`
In top level `daemon`-fixture that is..

Use a local `bg_daemon_spawn_delay` instead of
mutating the module-level `_PROC_SPAWN_WAIT` —
previously each `daemon` fixture invocation would
permanently add 1.6s (UDS) or 1s (CI) to the
global, inflating delays across the session.

Also, emit a `test_log.warning()` when verbose
loglevel is silently reduced to `'info'`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 16:23:50 -04:00
Gud Boi 60ce713016 Add cancel-cascade `TooSlowError` flake analysis
Document the ~0.3% rotating `trio.TooSlowError`
flake under `--spawn-backend=main_thread_forkserver`
full-suite runs. Root cause: `hard_kill`'s per-sub
1.6s graceful timeout compounding across N subactors
in a cancel cascade, plus cumulative autouse-reaper
teardown overhead.

Covers symptom, observed flaking tests, root-cause
family, ranked mitigations (cap bump -> CPU-count-
aware cap -> `pytest-rerunfailures` -> `hard_kill`
tuning -> targeted profiling), and a verification
protocol.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 13:56:51 -04:00
Gud Boi 0ef549fadb Add `tractor.trionics.patches` subpkg + first fix
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.

New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.

Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.

First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
  ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
  & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
  instance whose write-end is closed, and `drain()` then **spins forever
  in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.

Standalone repro:

    from trio._core._wakeup_socketpair import WakeupSocketpair
    ws = WakeupSocketpair()
    ws.write_sock.close()
    ws.drain()  # spins forever

Patch is one-line — break the drain loop on b'' EOF.

Manifested as two distinct test failures:

- `tests/test_multi_program.py::test_register_duplicate_name` hung at
  100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
  deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
  both threads parked in `epoll_wait`, no TCP connect-back to parent
  ever happened.

Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).

Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.

Conc-anal write-ups, including strace + py-spy evidence:

- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`

Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.

TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
      body's evidence section.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 12:18:03 -04:00
Gud Boi e9712dcaeb Add `tractor.spawn._reap.unlink_uds_bind_addrs()`
Inside a new new `tractor.spawn._reap` submod which kicks off providing
post-mortem subactor cleanup primitives, parent-side; consider it the
"sibling" of `tractor._testing._reap` which is the test-harness-oriented
brother mod.

Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454
where `hard_kill()`'s `SIGKILL` bypasses the subactor's
`_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files..

This adds 2 cleanup paths:
- explicit `bind_addrs` (when set at spawn time),
OR
- convention-based reconstruction from `subactor.aid.name + proc.pid`
  for the random-self-assign case.

`.spawn.hard_kill()` now invokes the cleanup unconditionally
post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError`
skip.

Future work — authoritative tracking via a per-process
UDS bind-addr registry — documented in module docstring,
deferred to a follow-up PR.

Co-fix: `tractor/spawn/_trio.py::new_proc` already passes
`bind_addrs` + `subactor` to `hard_kill` via prior work
on this branch.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 11:13:59 -04:00
Gud Boi 5cf0312c78 Add per-test runaway-subactor CPU detector to `_reap`
New `find_runaway_subactors()` helper + autouse
`_detect_runaway_subactors_per_test` fixture that
samples `psutil.cpu_percent()` on descendants to
catch tight-loop bugs (e.g. #452-class `recvfrom`
on a closed socket). Checks both at setup
(leftovers from a prior hung test) and teardown
(spawned by this test).

Intentionally does NOT kill the runaway — emits
a loud warning with diag commands (`strace`,
`lsof`, `ss`, `kill`) so the pid stays alive for
hands-on investigation. Session-end reaper still
SIGINT/SIGKILL survivors on normal exit.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 10:15:55 -04:00
Gud Boi 32e89c67ee Fix `maybe_override_capture` to not get invalid capX fixture names.. 2026-05-04 10:07:57 -04:00
Gud Boi d549c72052 Add fork-aware capture fixtures to `_testing.pytest`
Extend the pytest plugin with helpers that detect
and adapt to `--capture=sys` under fork-based
spawners (`main_thread_forkserver`, `mp_forkserver`)
where fd-capture causes hangs.

Deats,
- track `_cap_sys_passed_as_flag` + `_cap_fd_set`
  globals in `pytest_load_initial_conftests()`.
- add `@pytest.hookimpl(tryfirst=True)` + re-parse
  args after appending `--capture=sys`.
- `_is_forking_spawner()` predicate + fixture.
- `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys
  but weren't passed `--capture=sys`.
- `set_fork_aware_capture` fixture — returns the appropriate capture
  fixture per spawner backend based on `start_method: str` set via CLI.
- wire `set_fork_aware_capture` into `tractor_test`
  wrapper's fixture injection.

Also,
- add `alert_on_finish` session fixture (terminal
  bell on completion; tho not sure it works fully..)
- add `ids=` to `start_method` parametrize.
- restore `default=False` on `--enable-stackscope`.
- drop commented-out `--ll` option block; we will likely factor it to
  our plugin eventually however..

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-02 01:09:02 -04:00
Gud Boi 5a9926fc32 Adjust `test_shield_pause` for capsys backends
Under `main_thread_forkserver` the bootstrapping
hook switches to `--capture=sys`, so subactor
fd-level output (tree dumps, zombie-reaper msgs)
isn't captured per-test by pexpect. Gate those
expects behind a `no_capfd` check so the test
passes on both capture modes.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-01 19:08:55 -04:00
Gud Boi 72a0465c52 Default `--ll` to `None` in test harness
Only override `tractor.log._default_loglevel` when
the flag is explicitly passed — lets per-spawn and
per-example `loglevel` kwargs take effect instead
of being clobbered by the hard-coded `'ERROR'`
default.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-01 00:18:18 -04:00
Gud Boi 9431a81d37 Update debug examples + harden `test_debugger`
Pass explicit `loglevel` to `spawn()` calls in
`test_debugger` tests — required for pexpect
pattern matching now that examples no longer
hard-code log levels.

Also,
- make `expect()` return the decoded `before` str.
- add `start_method` param + fork-backend timeout
  slack (+4s) in nested-error test.
- clean up debug examples: drop unused loglevels,
  rename `n` -> `an`, fix docstrings, add TODO
  comments for tpt parametrize via osenv.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-01 00:13:22 -04:00
Gud Boi fc2e298a29 Update `sync_bp` + tighten `test_pause_from_sync`
Add `disable_pdbp_color()` to the `sync_bp` example
to suppress pygments prompt coloring when
`PYTHON_COLORS=0` — makes pexpect pattern matching
deterministic.

Deats,
- set `loglevel='pdb'` in both script + test spawn.
- disable `enable_stack_on_sig` in example, assert
  no `stackscope` output in test.
- update `attach_patts` keys/values with `|_<Task`
  / `|_<Thread` / `|_('subactor'` prefixes to match
  actual tree-dump format.
- add call-site patterns (`tractor.pause_from_sync()`
  `tractor.pause()`, `breakpoint(hide_tb=...)`).
- trim trailing `\n` from `Lock.repr()` output.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 20:54:50 -04:00
Gud Boi 48523358cf Add `use_stackscope` runtime var for subactor init
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.

Deats,
- add `use_stackscope: bool` to `RuntimeVars`
  struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
  successful `stackscope` import, asserts unset
  on `ImportError`
- nest stackscope init under `_debug_mode` gate
  in `Actor.async_main`, check rtvar alongside
  env var
- defer `maybe_init_greenback` import to its own
  `use_greenback` branch

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 20:50:07 -04:00
Gud Boi e2b790a70d Fix `SIGUSR1` tree-dump ordering in `_stackscope`
Factor the sub-actor relay loop out of
`dump_tree_on_sig()` into `_relay_sig_to_subactors()`
and chain both dump + relay in a single
`run_sync_soon` callback (`_dump_then_relay`) so the
parent's task-tree flushes BEFORE any sub receives
the signal — fixes a hierarchical-ordering race
where subs could dump ahead of the parent in the
muxed pty stream.

Also,
- gate file/tty sink writes behind `write_file` +
  `write_tty` params on `dump_task_tree()`.
- use `actor.aid.uid` instead of deprecated `.uid`.
- update `test_shield_pause` expects to match the
  new sequential parent -> relay-log -> sub ordering.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:35:55 -04:00
Gud Boi 61d4525137 Add `pytest_load_initial_conftests()` for `--capture=`
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.

Also,
- load `tractor._testing.pytest` via `-p` in ini
  (bc bootstrapping hooks must register before
  conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
  tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
  per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
  and pop env var on disable.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:29:51 -04:00
Gud Boi 0996a83655 Add `--uds`/`--uds-only` flags to `tractor-reap`
Wire up `find_orphaned_uds()` + `reap_uds()` from
`_reap` as a new phase-3 UDS sweep in the CLI
script. Opt-in via `--uds` (run after proc reap +
shm) or `--uds-only` (skip other phases).

Also,
- consolidate skip-proc-reap logic into a single
  `skip_proc_reap` bool covering both `--shm-only`
  and `--uds-only`
- extend header docstring + usage examples

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:26:15 -04:00
Gud Boi 1cdc7fb302 Add UDS orphan-sweep helpers + reap fixtures to `_reap`
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:

- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
  `reap_uds()` — detect `<name>@<pid>.sock` files under
  `${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
  the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
  lingering subactors, wait, SIGKILL survivors, then sweep orphaned
  UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
  snapshot sock-file dir before/after each test, warn + reap new
  orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
  with known-leaky teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:21:02 -04:00
Gud Boi 486249d74f Allow per-call `start_method`/`loglevel` overrides
In `tests/devx/conftest.py::spawn`, refactor the
fixture-internal closures so consumer tests can pass
explicit `start_method`/`loglevel` to each `_spawn()`
invocation rather than only inheriting the fixture-
scoped parametrize values.

Deats,
- promote `set_spawn_method()` and `set_loglevel()`
  to take their respective values as fn params (vs
  closing over the fixture-scope vars).
- give `_spawn()` `start_method=start_method` and
  `loglevel: str|None = None` kwargs so callers
  override one-off without re-parametrizing the
  suite. NOTE: this drops the implicit fixture-
  scoped `loglevel` forward — `_spawn()` callers
  now must pass `loglevel=...` explicitly.
- TODO: figure out how `--ll <level>` should map to
  the default (currently `None` → uses env-var or
  tractor default).
- add a docstring to `_spawn()` so its role as the
  consumer-facing closure is obvious from `help()`.

Also,
- `assert_before()` now returns the `.before` output
  on success (was `None`); add a one-line docstring
  describing the new return contract.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 14:17:41 -04:00
Gud Boi 8bc304f094 TOSQUASH 2d4995e0, fix _pformat -> devx.pformat.. 2026-04-29 18:47:29 -04:00
Gud Boi fc5e80fea5 Drop subint-family gate from `main_thread_forkserver`
`main_thread_forkserver` doesn't actually need py3.14
`concurrent.interpreters` (PEP 734) — it forks from a
non-trio worker thread and runs `_trio_main` in the child,
same shape as `trio_proc`. The previous `_has_subints`
gate + subint-family `case` arm were a copy-paste error.

In `tractor.spawn._main_thread_forkserver`,
- drop the `_has_subints` import + the `RuntimeError`
  raise in `main_thread_forkserver_proc()`.
- drop the now-unused `import sys` (only used by the
  prior error msg).

In `tractor.spawn._spawn.try_set_start_method()`,
- pull `'main_thread_forkserver'` out of the subint-
  family arm (which still gates on `_has_subints`).
- merge it into the `'trio'` arm — both set `_ctx = None`
  bc neither needs an `mp.context`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 18:13:46 -04:00
Gud Boi b7115fc875 Drop test-local timeouts, +`sync_pause` to dev
In `pyproject.toml`,
- include the `sync_pause` group from `dev`, so dev
  installs ship `greenback` for `pause_from_sync()`.

Comment out per-test `@pytest.mark.timeout(...)`
markers in,
- `tests/devx/test_debugger.py`
- `tests/discovery/test_registrar.py`
- `tests/spawn/test_main_thread_forkserver.py`
- `tests/spawn/test_subint_cancellation.py`
- `tests/test_advanced_streaming.py`
- `tests/test_cancellation.py`

The global cap was already dropped (3c366cac); these
were the leftover per-test caps which now block
interactive `pdb` flows under the new spawn backends.

In `uv.lock`,
- pull `greenback` into the resolved `dev` deps
  (per the `sync_pause` include above).
- catch up the prior `xonsh` editable→PyPI switch
  (from the `pyproject.toml` `tool.uv.sources` edit).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 18:10:40 -04:00
Gud Boi 208e7c0926 Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars
Add env-var overrides inside `._root.open_root_actor()` so
devs/test-runs can swap the actor-spawn backend or crank
console verbosity *without* touching application code.

In `._root.open_root_actor()`,
- read `TRACTOR_LOGLEVEL` early, overriding any caller-passed
  `loglevel` and stashing an `env_ll_report` to emit once the
  console log is set up.
- pull the `loglevel` fallback (`or _default_loglevel`) and
  `log.get_console_log()` init *up* so the env-var report
  routes through tractor's own logger.
- read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed
  `start_method` and warn-logging when the env-var clobbers
  an explicit caller value.

Wire the same vars through `tests/devx/conftest.py::spawn`,
- request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL`
  and `TRACTOR_SPAWN_METHOD` in `os.environ` before each
  `pexpect.spawn()` (inherited by the example subproc).
- expand `supported_spawners` to include
  `main_thread_forkserver` and `subint_forkserver` bc
  example scripts no longer need per-script CLI plumbing.
- pop both vars in fixture teardown so a leaked value can't
  re-route a later in-process tractor test's spawn-backend
  or loglevel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 17:29:38 -04:00
Gud Boi 22cdf15b73 Flip back to default `pytest` capture for CI 2026-04-29 15:03:26 -04:00
Gud Boi 532a9834f3 Add posix-multithreaded-`fork()` explainer doc 2026-04-29 12:50:23 -04:00
Gud Boi 2917b74ba4 Add todo for running `test_debugger` suite on forkserver spawner 2026-04-29 12:49:36 -04:00
Gud Boi 2d4995e08d Route `stackscope` SIGUSR1 onto trio loop
Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.

Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.

Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).

`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.

Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
  hint in the rendered dump header so users know where
  to find the artifact even when stdio is captured.
- swap the inline `f'     |_{actor}'` line for a
  `_pformat.nest_from_op` rendering of `actor_repr`
  (matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
  branches now note `(trio_token captured: <bool>)`
  so it's obvious from the log whether the full-tree
  path is wired.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 12:01:03 -04:00
Gud Boi 8c730193f9 Refine fork-survival docs + `EBADF` handling
Two cleanup tweaks in `_main_thread_forkserver`:

Doc, "what survives the fork?" section — expand the
"non-calling threads are gone in the child" claim with
the precise execution-vs-memory split that reconciles
this module's prior framing with trio's (canonical
[python-trio/trio#1614][trio-1614]) "leaked stacks"
framing:

- execution-side: only the calling thread runs
  post-fork; all others never execute another
  instruction.
- memory-side: those non-running threads' stacks +
  per-thread heap structures are still COW-inherited
  as orphaned bytes — what trio means by "leaked".

Same POSIX reality, opposite sides; the table is
extended to a 4-col `parent | child (executing) |
child (memory)` layout to make both views explicit.
Also blank-line-padded the bulleted hazard classes
for cleaner markdown rendering.

[trio-1614]: https://github.com/python-trio/trio/issues/1614

Code, `_close_inherited_fds()` log noise — split the
catch-all `except OSError` into:

- `EBADF` — benign race where the dirfd that
  `os.listdir('/proc/self/fd')` itself opened ends up
  in `candidates`, then auto-closes before the loop
  reaches it. Demote to `log.debug()` + `continue`;
  prior `log.exception` drowned the post-fork log
  channel with stack traces every spawn.
- other errnos (EIO / EPERM / EINTR / ...) keep the
  loud `log.exception` surface — those ARE genuinely
  unexpected.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:34:33 -04:00
Gud Boi 5418f2dc3c Add `--enable-stackscope` pytest plugin flag
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.

Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
  * exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
    inherit it via environ,
  * installs the handler in pytest itself via
    `enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
  existing `_debug_mode` gate to ALSO fire when
  `TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
  install the same handler at runtime startup.

Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:

- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
  always written) — guaranteed-readable artifact even
  under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
  tty; ignored if device is missing.

Other,
- squelch the benign `RuntimeWarning` ("coroutine method
  'asend'/'athrow' was never awaited") from
  `stackscope._glue`'s import-time async-gen type
  introspection so `--enable-stackscope` setup stays
  quiet.
- log msg in the `_runtime` ImportError branch now
  mentions `--enable-stackscope` alongside debug-mode.

Usage,
  pytest --enable-stackscope -k <hang-test>
  # in another shell, find the pid + signal:
  kill -USR1 <pytest-or-subactor-pid>
  # tail the artifact:
  tail -f /tmp/tractor-stackscope-<pid>.log

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:32:23 -04:00
Gud Boi 383b0fdd75 Backend-aware `fail_after` in pub/sub test
Mirror `060f7d24`'s pattern (backend-aware timeout in
`maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard
`trio.fail_after` cap. Fork-based backends pay per-spawn
fork+IPC-handshake cost which stacks over `cpus - 1`
sequential `n.run_in_actor()` calls; empirically 12s
flakes on `main_thread_forkserver` under UDS
cross-pytest contention (#451 / #452).

Defaults:
- `main_thread_forkserver` → 30s
- everything else          → 12s (unchanged)

Hoist the timeout-pick out of the `main()` closure so the
dispatch happens once in the trio task rather than
re-evaluating per spawn.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:28:48 -04:00
Gud Boi 060f7d24c4 Backend-aware timeout in `maybe_expect_raises`
Default `timeout` from `int = 3` → `int|None = None`;
when unset, pick a backend-aware value. Fork-based
backends (`main_thread_forkserver`) need real headroom
bc actor spawn + IPC ctx-exit + msg-validation error
path is much heavier than under `trio` backend —
especially under cross-pytest-stream contention (#451).

Defaults:
- `main_thread_forkserver` → 30s
- everything else          → 3s (unchanged)

Empirical flake history that motivated 30s as the floor
on fork backends (all from `test_basic_payload_spec`):

- 3s  → all-valid variant flaked w/ `TooSlowError`
- 8s  → `invalid-return` variant flaked w/ `Cancelled`
        (surfaced instead of `MsgTypeError` bc the
        outer `fail_after` fired mid-error-path)
- 15s → flaked under cross-pytest-stream contention

30s gives plenty of headroom while still failing-loud
on a genuine hang. Callers can opt out by passing an
explicit `timeout=` kw.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:21:56 -04:00
Gud Boi 3c366cac13 Drop global `pytest-timeout` cap from `pyproject.toml`
`timeout = 200` was firing via SIGALRM (the default
`method='signal'`) which synchronously raises `Failed` in
trio's main thread mid-`epoll.poll()`, abandoning trio's
runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half-
installed. EVERY subsequent `trio.run()` in the same pytest
session then bails with
`RuntimeError: Attempted to call run() from inside a run()`.

Empirical impact: a session that hits a single 200s hang
cascades into 30-40 false-positive failures across every
downstream test file that uses `trio.run`. Recent UDS run
saw 1 real timeout (`test_unregistered_err_still_relayed`)
poison 38 sibling tests with cascade-fails — a debugging
nightmare.

Same architectural bug we already documented in
`tests/test_advanced_streaming.py::test_dynamic_pub_sub`
(see its module-level NOTE) — both `pytest-timeout`
enforcement modes are incompatible with trio under fork-
based spawn backends. Now scoped session-wide.

For tests that legitimately need a wall-clock cap, the
canonical pattern is `with trio.fail_after(N):` INSIDE the
test — trio's own `Cancelled` machinery cleanly unwinds
the actor nursery without disturbing global state.

For CI: rely on job-level wall-clock timeouts (e.g. GitHub
Actions `timeout-minutes`) to abort genuinely-stuck suites.

`pyproject.toml` comment block spells this all out so a
future contributor doesn't reach back for `timeout =` and
re-introduce the bug.

ALSO, bump `xonsh` to at least `0.23.0` release.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-28 16:00:16 -04:00
Gud Boi f8178df0fd Return parent `pid: int` from new `reap_subactors_per_test` fixture 2026-04-27 23:27:19 -04:00