The `cpu_scaling_factor()` CI bump (2x) wasn't enough for the
macOS runners — slower + noisier than linux for our multi-actor
cancel-cascade timing — so `test_nested_multierrors[depth=3]` and
`test_fast_graceful_cancel_*` flaked there (linux + sdist green),
- make the CI headroom platform-aware: 3x on macOS, 2x on linux
(keeps the proven linux budget; depth=3 inner 12*3=36s still
fits under the 40s outer wall).
- give `test_fast_graceful_cancel_*` the `cpu_scaling_factor()`
headroom it was missing entirely (was a bare `timeout=2.9`).
Verified locally w/ `CI=1` (2x linux path; 3x macOS path confirmed
via forced `_non_linux`).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
All 5 flagged items were valid (4 real bugs + 1 dead assert),
- fix an inverted `sys.version_info < (3, 14)` guard in
`ipc._linux` — the "`cffi` has no 3.14 support" import note now
fires on 3.14+ (where it applies) instead of on older pys.
- use `os.environ.get('PYTHON_COLORS')` in the `sync_bp` example
so it doesn't `KeyError` when run outside the test harness.
- correct `dump_task_tree()`'s docstring: the `/tmp` + `/dev/tty`
tee is gated on `write_file`/`write_tty`, not "unconditional".
- tidy the `ActorTooSlowError` message spacing in `cancel_actor`.
- replace a tautological `applied is True or applied is False` in
`test_patches` with `isinstance(applied, bool)` (the value is
order-dependent across the module).
Review: PR #462 (copilot-pull-request-reviewer)
https://github.com/goodboy/tractor/pull/462#pullrequestreview-4527179852
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
GH Actions (and most shared) CI runners are slow + noisy and —
unlike a throttled local box — don't expose CPU-freq scaling via
sysfs, so `cpu_scaling_factor()` read `1.0` and the timing-
sensitive deadlines/asserts that key off it got NO headroom on
CI (a class of `TooSlowError` / `assert diff < this_fast` flakes),
- add a flat `_ci_env` x2 bump inside `cpu_scaling_factor()` so
every test already using it (quad streaming, SIGINT-cancel,
docs examples, ...) gets CI headroom for free — compounds with
any local-throttle factor.
- route the `time_quad_ex` cancel-deadline through it instead of
a bespoke per-test `ci_env` bump.
- fix a real bug in `test_nested_multierrors`: its OUTER
`@tractor_test(timeout=10)` was *smaller* than the inner
`fail_after_w_trace` budget (trio depth=3 = 12s), so the outer
wall fired first and pre-empted the snapshot-capturing inner
deadline -> `FAILED` instead of dumping. Bump the outer to `40`
(> max inner budget) and scale the inner budgets by
`cpu_scaling_factor()` too.
Verified locally with `CI=1` (quad + both `test_nested_multierrors`
depths + `test_cancel_via_SIGINT_other_task` green).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The `TRACTOR_LOGLEVEL`/`TRACTOR_SPAWN_METHOD` override-notice
branches were unreachable: `loglevel`/`start_method` were
reassigned to the env value BEFORE the `!=` compare, so the
"OVERRIDES caller-passed" message never fired. Capture the
caller value first, then compare. Rel. `208e7c09`/`d4eac06d`
"Honor env-vars" (`trionics.start_or_cancel`); surfaced by
`/code-review high` on #462.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two defensive fixes around the `Portal.cancel_actor()` +
`_try_cancel_then_kill()` escalation from `34f333a0`
"Escalate cancel-ack timeouts to `proc.terminate()`" (the
`trionics.start_or_cancel` follow-up); surfaced by
`/code-review high` on #462,
- guard `proc.terminate()` for backends whose `proc` slot
isn't a `Process` — the future `subint` backend stores an
`int` interp-id, so escalation would `AttributeError`
instead of hard-killing; now it logs + no-ops.
- swap `assert cs.cancelled_caught` for an
`if cs.cancelled_caught and raise_on_timeout:` guard so an
unexpected shielded-scope exit returns a soft `False`
rather than crashing `cancel_actor()` mid-teardown.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two fixes to the hang-debug SIGUSR1 task-tree dump path,
surfaced by `/code-review high` on #462,
- re-add `_debug_mode` to the sub-actor handler-install gate
in `_runtime.py`. Dropping it (rel. `3a386ba5`/`3d9c75b6`
"Drop debug_mode gate", from the `custom_log_levels_api`
follow-up) was meant to *also* enable non-pdb runs, but
nothing sets `use_stackscope` from `debug_mode`, so
debug-mode subs were left with NO handler — and the default
SIGUSR1 disposition then *kills* them. Now additive:
`_debug_mode OR use_stackscope OR env`.
- pass `write_file=True` at both `dump_task_tree()` SIGUSR1
call sites so the advertised `/tmp/tractor-stackscope-<pid>`
`.log` tee is actually written (was dead under
`--capture=fd`). Matches `1b1ef10a` "Re-enable writing
`stackscope` to file by default"; param from `0df90500`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Left-over debug trap from the `_runtime_vars` pure get/set
refactor — it fired on *every* struct-form rt-var write (e.g.
via `.update()`), hanging any non-tty / CI / forked actor on
`pdb` stdin.
Surfaced by a `/code-review high` pass on #462.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Regenerate the lockfile so it's consistent with the
post-rebase `pyproject.toml` — which now carries both #461's
landed tooling (`pytest>=9.0.3`, …) and this branch's
tractor deps (`setproctitle`, `pytest-timeout`, `psutil`),
- `uv lock` resolves the merged dep set against the landed
`main` baseline.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`open_root_actor()` writes `_enable_tpts` (and friends) into
the process-global `_state._runtime_vars` dict but nothing
resets it on actor teardown. Under the in-proc `pytest`
launchpad a uds-using test leaks `_enable_tpts=['uds']` into
a sibling tcp test, tripping the
`registry_addrs`×`enable_transports` proto-guard in
`open_root_actor()` with a `ValueError`.
New `_reset_runtime_vars` fixture snapshots + restores the
dict around every test so no runtime-var state crosses a
test boundary.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Same trio 0.29 → 0.33 cancel-cascade slowdown that hit
`test_nested_multierrors` (ea67f1b6) — bumps the
`trio`-backend (non-debug, non-forking) budget in
`test_echoserver_detailed_mechanics` from 1s → 4s.
- The 1s budget raced the ~1s teardown deadline. On a
deadline-fire trio 0.33 injects
`Cancelled(source='deadline')` (cancel-reason
metadata) that wraps the mid-stream KBI in a
`BaseExceptionGroup`, breaking the bare
`pytest.raises(KeyboardInterrupt)` below.
- Bump matches the forking-spawner branch (4s).
- Inline NOTE references the tracking issue
`ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d7da502d93)
(cherry picked from commit d0144e52cb)
trio 0.29 → 0.33 lock bump (c7741bba) slowed the
depth=3 cancel-cascade in `test_nested_multierrors`
from <6s to ~7-8s; the 6s deadline was firing and its
`Cancelled(source='deadline')` (trio 0.33's new
cancel-reason metadata) collapsed a BEG branch,
breaking the `RemoteActorError` assertion downstream.
- Split the `('trio', _)` case-match into per-depth
arms: `('trio', 1)` keeps 6s (still finishes in
~3s); `('trio', 3)` → 12s.
- Updated inline NOTE explains the version pivot +
links the tracking issue
`ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
- Existing MTF/`subint_forkserver` budgets unchanged.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit ea67f1b67b)
(cherry picked from commit 57b3ea59ea)
Strip the trailing `pkg_path` token ONLY when it duplicates the
caller's leaf-*module* name (which the console header already
shows via `{filename}`), instead of blindly dropping the last
token. This keeps genuine, possibly-*nested* sub-PACKAGE parts
addressable as their own sub-loggers.
- detect a true leaf-mod by comparing the caller's `__name__`
vs `__package__` (a pkg `__init__` has them equal -> its
trailing token is a real sub-pkg, NOT a leaf to strip).
- `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT
from a bare `devx` -> `tractor.devx`; the old unconditional
`pkg_path = subpkg_path` collapsed both to `tractor.devx` and
silently broke per-sub-pkg level control via the logging-spec.
- `get_logger(__name__)` leaf-strip still works (cosmetic, bc
the leaf-mod is in the `{filename}` header field).
Also,
- update the `LogSpec` caveat: sub-PACKAGE granularity now
addressable at ANY depth; leaf *modules* intentionally aren't
(they're the `{filename}`); top-level mods (eg. `to_asyncio`)
still emit on the root logger.
- adjust `test_root_pkg_not_duplicated_in_logger_name` to the
new literal explicit-`name` contract (no leaf-collapse).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9c36363b01)
Two coupled changes that let downstream projects (eg. `modden`) inherit
the test-harness loglevel plumbing for free via
`tractor._testing.pytest`:
Plugin lift (`tests/conftest.py` → `_testing/pytest.py`),
- mv `pytest_addoption(--ll)`, the `loglevel` autouse
fixture, and `test_log` fixture out of the test-suite-
local conftest into the reusable plugin.
- add `--tl`/`--tractor-loglevel` as a DISTINCT flag from
`--ll`: `--ll` is the consuming-project's OWN app
loglevel (scoped to its pkg-hierarchy), `--tl` is the
`tractor.*` runtime loglevel. `--tl` falls back to
`--ll` when unset (preserves current `tractor`-suite
behavior).
- add `testing_pkg_name` session fixture (default
`'tractor'`) — downstream projects override to e.g.
`'modden'` so `--ll` scopes to their own hierarchy
instead of `tractor.*`.
- `loglevel` fixture now yields the resolved
tractor-runtime level (passed to
`open_root_actor(loglevel=<.>)` by `@tractor_test`)
AND separately applies `--ll` to the
`testing_pkg_name` hierarchy when that isn't
`tractor`. `test_log` scopes the per-test logger to
`testing_pkg_name`.
`tractor.log` "logging-spec" mini-DSL,
- `LogSpec = str|bool`. Accepted forms:
- `True` → enable `pkg_name` root at `default_level`
(fallback `'cancel'`).
- `False` → no-op.
- bare level eg. `'info'` → root-logger at that level.
- `'sub:info,x:cancel'` → per-sub-logger filter-spec;
each `<name>` is RELATIVE to `pkg_name` (must NOT
include the pkg-token).
- `parse_logspec()` → `{sublog|None: level}` mapping.
`None` key = root-logger. Mixed bare-level + filters
in one spec is rejected w/ a helpful err msg; so is
embedding the `pkg_name` token in a sub-name.
- `apply_logspec()` → `(primary_level, {name: log})`:
parses then enables a `colorlog` stderr handler per
named (sub)logger. Authoritative sub-logger filters
get `propagate=False` so they don't double-emit
through a parallel root-level handler.
- !GRANULARITY CAVEAT! sub-logger names match at
sub-pkg granularity, not leaf-module — so `devx.debug`
collapses to the same `tractor.devx` logger as a bare
`devx`, and top-level lib modules (eg.
`tractor.to_asyncio`) emit under the *root* logger
rather than a phantom `to_asyncio` child. Documented
inline on `LogSpec`.
Other,
- `tests/conftest.py` keeps a NOTE pointing to the
plugin for future-debugging clarity (don't remove
silently — the lift is the relevant signal).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 19a77708ba)
Factor the "deliver an exc to a running aio task" pattern out of
`translate_aio_errors()` + `open_channel_from()` into a shared
`maybe_signal_aio_task()` helper. Add a cause-chain matrix comment
+ relay-echo guard so the final-raise block can't cycle
`trio_err.__cause__` back onto its own derivative relay.
`maybe_signal_aio_task()`,
- Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT
`aio_task.set_exception()` which on py3.13+ ALWAYS raises
`RuntimeError("Task does not support set_exception")` (dead code as
a relay mechanism).
- Returns `(delivered: bool, report: str)`. Caller uses `delivered` to
flip `wait_on_aio_task` when delivery failed (avoids hanging on
`_aio_task_complete.wait()`).
- `pre_captured_fut=`: required when the caller crosses a trio
checkpoint between capturing `_fut_waiter` and invoking the helper.
`Task._wakeup` clears `_fut_waiter = None` so re-reading
post-checkpoint loses the ref even though the exc is still in-flight
on the (now-`done()`) original fut.
- `cause=`: sets `exc.__cause__ = cause` so the relay carries
a "trio_err -> caused -> relay" chain through `set_exception()`
→ `Task._wakeup` → coro raise → `wait_on_coro_final_result`
→ `signal_trio_when_done` → `task.result()`-raise.
- `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the
narrow case where `_fut_waiter is None` AND task is runnable (sitting
in asyncio's ready queue, not parked on a poke-able future). NEVER
cancels when `_fut_waiter` carries an in-flight exc — that would race
+ mask the real terminating exc.
`translate_aio_errors()`,
- Replace the two ad-hoc `_fut_waiter.set_exception()`
/ `aio_task.set_exception()` call sites w/ the helper.
- Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown
`trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref).
- New "cross-loop cause-chain matrix" comment block on the final-raise
— tabulates every `(trio_err, aio_err, trio_to_raise)` combo into
exactly one terminal `raise X [from Y]` or early `return`. Covers the
sibling `signal_trio_when_done()` resolution + the relay-echo
INVARIANT.
- New relay-echo guard: if `aio_err` is one of OUR OWN signals
(`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is
trio_err`, raise the bare `trio_err` instead of `trio_err from
aio_err` (which would CYCLE the cause chain since the relay was itself
caused-by `trio_err`).
- Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT
EXCEPT or this WILL HANG" warning — the helper handles the failure
path explicitly via `delivered=False` → `wait_on_aio_task = False`.
- Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and
the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback
shows the real root.
`open_channel_from()`,
- Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?"
+ `tractor.pause(shield=True)` panic branch.
- Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as
`cause=` — non-`None` only when the `try` body raised (graceful exit
→ None).
Other,
- Top-level import: `sys` (for `sys.exc_info()`).
- `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await
trio_done_fute` next to the shielded version — exploratory note for
the longstanding "why is `.shield()` needed?" TODO.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit acd1cbeec4)
SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).
Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
conjunction around the `enable_stack_on_sig()` call;
now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
`use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
of the `if rvs['_debug_mode']:` block to peer-level.
The `use_greenback` branch stays inside `_debug_mode`
(pdb-specific).
- Refresh inline comments on both sites to call out the
infected-`asyncio` "default SIGUSR1 = terminate proc"
rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3d9c75b6ed)
Adopt the `_testing.trace` CM helpers in two MTF-hang-prone
tests so on-timeout we get a fresh
`ptree`/`wchan`/`py-spy` diag snapshot on disk instead of
opaque pytest timeout-kills. Same shape as bd07a95d for
`test_dynamic_pub_sub`.
Deats,
- `test_echoserver_detailed_mechanics`:
* inner `trio.fail_after` → `fail_after_w_trace`. Adds
`fail_after_w_trace: FailAfterWTraceFactory` fixture
param.
* mv per-backend `timeout` calc to top of test body (was
interleaved w/ helper defs).
* factor deep
`open_nursery`/`open_context`/`open_stream` body into
`_body()` so the wrapping `main()` stays a 2-liner —
keeps the nested-CM block at its natural indent level
instead of pushing it under yet another `async with`.
* drop `with_timeout: bool` knob + `fa_main()` helper
(knob was hard-coded `True`).
- `test_sigint_closes_lifetime_stack`:
* outer `signal.alarm`/`try`/`finally` → single
`afk_alarm_w_trace(10)` CM. Adds
`afk_alarm_w_trace: AfkAlarmWTraceFactory` fixture
param.
* drop `_AFK_CAP_S` + `armed_alarm` vars (CM owns both).
* explanatory comment refreshed to mention
`AFKAlarmTimeout` + the disk-snapshot side effect.
Other,
- Drop debug `return 1e3` short-circuit from `delay()`
fixture — snuck in as a scratch line, was clobbering the
proper `debug_mode`-branched return.
- Top-level import: `FailAfterWTraceFactory`,
`AfkAlarmWTraceFactory` from `tractor._testing.trace`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 1cafaecf52)
Per-terminal optimized `watch`-like xonsh alias that
runs an arbitrary callable alias in a loop inside the
alt-screen buffer with flicker-free repaint. Supersedes
the inline `acli.ptree` polling .xsh snippet (removed
from `_ptree` docstr in favor of
`acli.watch acli.ptree pytest`).
Deats,
- alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide
(`\033[?25l/h`) wrapped in try/finally so Ctrl-C always
returns to a pristine shell.
- per-frame draw uses cursor-home (`\033[H`) + per-line
EL (`\033[K` before each `\n`) + post-draw erase-down
(`\033[J`) → stale tail chars from a longer prior
frame are obvi cleared; no full-screen flash.
- SIGWINCH-aware: terminal resize sets a flag, next
frame does a full clear (`\033[H\033[2J`) instead of
the cheap cursor-home path.
- Ctrl-C handling: install `signal.default_int_handler`
so `KeyboardInterrupt` lands cleanly; prior handler
restored on exit.
- Output capture: redirect the alias's stdout to
`StringIO` per frame so we can post-process the EL
fix. Aliases writing directly to `sys.stdout.buffer`
/ `os.write(1)` bypass capture — EL-fix won't apply
but loop still works.
- Alias unwrap: xonsh stores callables as either a bare
callable OR `[fn, *preset_args]`. Both forms handled;
subprocess-style aliases rejected w/ a friendly err
msg.
- `argparse` w/ `-n`/`--interval` (default 0.3s); rest
of argv forwarded as alias args.
- Reg `'acli.watch': watch` in `_TCLI_ALIASES`.
Other,
- Tn `_ptree` `args: list[str]` param.
- Mod-header `Provides:` block updated w/ `acli.watch`
entry.
- Top-level imports: `os`, `sys`, `signal`, `time`,
`typing.Callable`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit bb239e847f)
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).
Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.
Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
filter rule + its rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit a6d4ac3aac)
Replace inline `trio.fail_after` + manual `signal.alarm` guard with the
`_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy
diag snapshot to disk on timeout.
Deats,
- inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM,
captures on `TooSlowError`).
- outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync
CM, captures on `SIGALRM`), only armed under fork backends.
Extracts `_run_and_match()` helper to keep branching clean.
- bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes
while diag harness accumulates evidence.
- drop `_DIAG_CAP_S` var + manual signal import (now internal to
`afk_alarm_w_trace`).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit bd07a95d80)
Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
`track_orphaned_uds_per_test`,
`detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
MTF `xfail(strict=False)` with detailed race-window comment
explaining the BEG shape mismatch, wrap body in
`fail_after_w_trace` with per-backend timeout budget, bump
`@tractor_test(timeout=10)`, drop old multiprocessing depth
special-casing
- `test_multierror_fast_nursery`: wrap in
`fail_after_w_trace(30.0)`, accept `TooSlowError` in
`pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
`spawn_backend` param for `is_forking_spawner`, widen
`fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
`test_cancel_infinite_streamer`, `test_some_cancels_all`: add
`set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
covered by module-level `pytestmark`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 32955db02e)
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 01ce2857ea)
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).
Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8de684f5de)
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
populated by `_do_capture_snapshot()` on each successful dump;
add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
prints all captured snapshot dirs at end-of-session so paths
don't get buried in scrollback
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit fb87c36263)
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
`tractor._child` procs NOT in the walk's `seen` set — surfaces
ghost subactor trees from prior test runs (cross-test launchpad
contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
into `_classify_walk()` closure, walk stray roots as additional
trees, emit stray-root summary in header. Also: `tractor._child`
procs reparented to init are now always classified as orphans
regardless of cgroup-slice (leaked subactor ≠ desktop-launched
app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
`--capture=sys` redirection so snapshot paths always land on the
real terminal
- `fail_after_w_trace()`: capture diag snapshot on
non-`TooSlowError` exceptions when the `fail_after` scope's
cancel had already fired (e.g. nursery wraps `Cancelled` into a
`BaseExceptionGroup` that escapes before `TooSlowError` can be
raised).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3a243a1fd4)
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.
Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
hung-state dumper, bindspace scanner, `dump_all()` snapshot
archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
(trio `fail_after` + auto-snapshot on `TooSlowError`),
`afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
`SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
`_testing.trace` — drops ~627 lines of inline impl. Add
`acli.dump_all` alias for full snapshot-bundle CLI access.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7509e313ff)
Deats,
- `test_echoserver_detailed_mechanics`: add `is_forking_spawner`
param, wrap `main()` in `fa_main()` with per-backend
`trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade
teardown that compounds under forkserver.
- `test_sigint_closes_lifetime_stack`: swap `start_method` param
for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so
KBI firing before `open_context` body doesn't `UnboundLocalError`,
add `pytest.fail` guard for the spawn-time IPC race case, arm
`signal.alarm` AFK-safety cap (10s) under fork backends
Also,
- `pytestmark`: add `track_orphaned_uds_per_test` +
`detect_runaway_subactors_per_test` fixtures.
- `delay()`: hardcode `return 1e3` at top (debug override still in
place).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7ee0dc2e8f)
For forking spawner backends that is.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b10011a36e)
- `skipon_spawn_backend('subint')`: expand reason with specific
analysis doc refs + GH issue #379 umbrella link.
- add `track_orphaned_uds_per_test` fixture via `usefixtures` to
blame-attribute UDS sock-file orphans left by SIGKILL cancel
cascades.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7d0a53d205)
- `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with
`trio.fail_after(2)` + commented `capfd.disabled()` investigation
(pytest#14444).
- `test_basic_payload_spec`: add fixture param with note on fork-spawner
hang prevention.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8aa07a7932)
Outer `signal.alarm` cap that fires even when trio's
`fail_after` is blocked by a shielded-await deadlock
(the bug-class-3 hang under MTF backends). Only armed
for fork-based spawners where the bug lives.
Deats,
- `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the
trio-native guard so it always loses when the in-band path works.
- `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the
last-fired breadcrumb names the swallow point on hang.
- try/finally wrapping around each scope level for deterministic
breadcrumb emission.
- add `is_forking_spawner`, `set_fork_aware_capture` fixture params.
- rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12).
Also,
- `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add
TODO re `pytest.raises()`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 10db117864)
Split the old `live`/`orphans` sock classification
into three ppid-aware buckets: `live-active` (PID
alive, parent owns it), `orphaned-alive` (PID alive
but `ppid==1`, init-adopted — `acli.reap` candidate),
and `orphaned-dead` (PID gone, sock stale).
Deats,
- new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent
PID, handles the tricky `(comm)` field (can contain spaces/parens) by
splitting from last `)`.
- live-active rows now show `(ppid=<N>)` for ctx.
- orphaned-alive rows flagged `(adopted by init)`.
- cleanup suggestion: `acli.reap --uds` for both
alive-orphan graceful cancel + dead-sock cleanup
in one shot; manual `rm` kept as fallback.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9bbb6f796b)
New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md`
documenting the spawn-time rc=2 race under rapid
same-name spawning against a forkserver + registrar
— the `wait_for_peer_or_proc_death` helper now surfaces
the death instead of parking forever on the handshake
wait.
Also,
- extract inline `xfail` into module-level
`_DOGGY_BOOT_RACE_XFAIL` marker.
- apply it to `n_dups=8` too (previously bare) bc
larger N widens the race window enough to fire
occasionally.
- link to tracking issue #456.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 92443dc4ef)
Forking spawner + UDS transport has different timing
vs `trio_proc` — streaming example completes faster
in some cases, slower in others depending on fork
overhead + sock setup.
Deats,
- add `expect_cancel` param to `cancel_after()`, raise
`ActorTooSlowError` when cancel scope fires unexpectedly instead of
silently returning `None`.
- `time_quad_ex` fixture: bump timeout +1 for forking+UDS, explicit
`ActorTooSlowError` on `None` result instead of bare `assert results`.
- `test_not_fast_enough_quad`: `xfail` for forking+UDS being "too fast"
(cancel doesn't fire bc streaming finishes before delay).
- add `is_forking_spawner`, `tpt_proto` fixture params throughout.
Also,
- `_testing/pytest.py`: widen `start_method` parametrize and
`is_forking_spawner` fixture to `scope='session'`.
- `"""` -> `'''` docstring style throughout.
- hoist `_non_linux` to module scope (was redefined locally in two
places).
- type hints, kwarg-style `partial()` calls.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d3cbc92751)
`acli.bindspace_scan piker` now resolves `<name>` to
`$XDG_RUNTIME_DIR/<name>` — useful for projects like
`piker` that bind sibling sub-dirs alongside tractor's
default. Full paths still work as-is.
Also,
- rename "unparseable" section to "non-tractor" with
clearer desc (filename lacks `@<pid>` suffix)
- print per-sock `ss -lpx 'src = <path>'` cmds for
non-tractor socks so callers can manually resolve
listener-PID liveness
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 099104e0af)
Add module-level `pytestmark` applying per-test
`reap_subactors_per_test`, `track_orphaned_uds_per_test`, and
`detect_runaway_subactors_per_test` fixtures — registrar tests stress
discovery roundtrips that historically left orphaned UDS sock-files.
Deats,
- drop unused `say_hello()` fn, keep only `say_hello_use_wait`;
rename param `func` -> `ria_fn`.
- use `@tractor_test(timeout=7)` instead of separate
`@pytest.mark.timeout(7, method='thread')` decorator.
- add `with_timeout()` helper, wire into
`test_subactors_unregister_on_cancel_remote_daemon`.
- uncomment `_timeout_main()` in `test_stale_entry_is_deleted`, use
configurable `timeout` var + `debug_mode` guard for `tractor.pause()`
on cancel.
- `dump_on_hang(seconds=timeout*2)` instead of hardcoded `20`.
- fix typo "oustanding" -> "outstanding".
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit abd3950ba6)
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.
Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
`tractor._child` markers, falls back to `/proc/<pid>/comm` for
zombie-resilient detection (kernel preserves `comm` past exit
until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
— the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
`/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
`_is_tractor_subactor()` for intrinsic sub-actor ID instead of
cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
zombie-state sub-actors
Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
distinguish `system.slice` services vs `user.slice` desktop apps
from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
`user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
tractor importable from active venv
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 522b57570b)
New `tractor.devx._proctitle` mod sets each
sub-actor's `argv[0]` (and kernel `comm`) to
`tractor[<aid.reprol()>]` — e.g.
`tractor[doggy@1027301b]` — so `ps`/`top`/`htop`
and `acli.pytree`/reaper tooling can identify
actors at a glance without parsing full cmdlines.
Deats,
- `set_actor_proctitle()` wraps the `setproctitle`
pkg with `ImportError` guard; optional at runtime
but listed in `pyproject.toml` so default installs
benefit.
- called early in `_child._actor_child_main()` after
`Actor` construction, before `_trio_main()` entry.
- tests in `tests/devx/test_proctitle.py`: format
unit test, `/proc/{cmdline,comm}` integration
test, negative detection test.
Resolves#457
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d60245777e)
Extend `test_register_duplicate_name` w/ cancel-level log
breadcrumbs and `try/finally` for better diag on the cancel-cascade
hang.
Add `test_dup_name_cancel_cascade_escalates_to_hard_kill` as a
regression test for the TCP+MTF duplicate-name cancel-cascade
deadlock. Spawns N same-name actors, calls `an.cancel()`, and
asserts teardown completes within a `trio.fail_after()` budget that
scales w/ `n_dups`.
Deats,
- parametrize `n_dups` (2, 4, 8) to widen the race window for
concurrent `register_actor` RPCs.
- `n_dups=4` xfail'd — exposes a separate boot-race bug (doggy
`rc=2` under rapid same-name spawn), tracked in #456.
- post-teardown asserts all `Portal` chans disconnect, verifying
hard-kill escalation worked.
Relates to https://github.com/goodboy/tractor/issues/456
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit caebf60f4e)
Group all xontrib aliases under an `acli.` prefix
so xonsh prefix-completion treats them as a sub-cmd
group — `acli.<TAB>` lists the full set. No parent
`acli` cmd exists; the dot is purely naming.
Renames (incl `-` -> `_` in suffixes for shell-
identifier-friendliness):
- `pytree` -> `acli.pytree`
- `hung-dump` -> `acli.hung_dump`
- `bindspace-scan` -> `acli.bindspace_scan`
Add new `acli.reap` wrapping `scripts/tractor-reap`:
Deats,
- 3 opt-in phases via flags:
1. process reap — `find_orphans()` (default,
PPid=1 + cwd=repo + cmdline `python`) or
`find_descendants(--parent PID)`. SIGINT
first, SIGKILL after `--grace` (def 3.0s).
2. `/dev/shm` sweep (`--shm`/`--shm-only`) —
`find_orphaned_shm()` + `reap_shm()`. needed
bc `tractor` disables `mp.resource_tracker`.
3. UDS sock-file sweep (`--uds`/`--uds-only`) —
`find_orphaned_uds()` + `reap_uds()` for stale
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock`
entries. See #452.
- `--dry-run` lists matches without signalling/
unlinking; survivor pids or sweep errors flip
the alias rc to `1`.
- lazy-imports `tractor._testing._reap` after
`git rev-parse --show-toplevel` (with
`Path(__file__).parent.parent` fallback) so the
contrib is loadable before the venv is on
`sys.path`.
- `argparse.SystemExit` on `-h`/bad-args is
caught + returned as the alias rc instead of
killing xonsh.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit cec6cc2a56)
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:
graceful cancel-req -> bounded wait -> hard-kill
Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
of the legacy DEBUG-log + return-`False` path. Default stays `False`
for callers that handle their own escalation (e.g.
`_spawn.soft_kill()` polling `proc.poll()`).
- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
(SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
waiting on `proc.poll()`.
- replace `tn.start_soon(portal.cancel_actor)` in
`ActorNursery.cancel()` with the helper.
Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
opt-in).
ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.
Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 34f333a026)
Distinct from `trio.TooSlowError` so that existing
`except trio.TooSlowError:` blocks don't silently
mask actor-cancel timeouts — these must propagate
to let a supervisor escalate to
`proc.terminate()` per SC-discipline:
graceful cancel-req -> bounded wait -> hard-kill
Motivated by #subint_forkserver dup-name hang
where `Portal.cancel_actor()` silently swallowed
the timeout and the supervisor never escalated,
leaving a same-named sibling subactor parked
forever.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 38ffb875bd)
Pre-compute `mismatch_lines` str instead of `+`-concat
inside the f-string raise site; slightly easier to read
and avoids the `+ '\n\n'` continuation.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5cd06810db)
Refactor `pytest_load_initial_conftests()` to split
the fork-spawn × capture-mode check into two policies:
- CI (`CI` env-var set): `pytest.exit(rc=2)` on
mismatch — forces every matrix-row to declare
`--capture=sys` explicitly.
- local: `warnings.warn()` + continue — lets devs
experiment with `--capture=fd` to validate fixes.
Deats,
- drop `_cap_fd_set` global; add
`_CAPSYS_REQUIRED_SPAWNERS` frozenset for the
spawner-name lookup
- move inline comment wall → proper docstring w/
Background, Trade-off, Validation-policy sections
- `maybe_xfail_for_spawner()` now takes
`request: pytest.FixtureRequest` and reads
`request.config.option.capture` instead of the
`_cap_sys_passed_as_flag` global
- recognize `tee-sys` as fork-safe (only `fd`-level
capture deadlocks)
- `set_fork_aware_capture()` returns the actual
capture mode str from config, not a hardcoded
`'sys'`
- lift `import warnings` to module level (was duped
inside `pytest_configure`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 255c9c3a7c)
Extend `pytree` with two usability improvements:
- `--tree`/`-t` opt-in flag emits a flat walk-order `## tree` section at
the top preserving contiguous parent-child shape (no
severity-grouping), so the full tree structure is visible without
cross-ref'ing between severity buckets.
- Cross-bucket parent annotation: when a row's parent (by ppid) lives in
a *different* severity bucket, suffix with `[parent: <pid> (in
`<bucket>`)]` so the `└─` marker resolves even when bucketing scatters
parent/child into separate sections.
Also,
- split arg parsing into flag vs positional args.
- add `pid_to_bucket` dict + `walk_order` list to back both features
- rename inner `ppid` shadow to `ppid_str` to avoid collision with the
outer `ppid` variable.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0f4e671862)
Raise `ValueError` from `open_root_actor()` when any
`registry_addrs` entry uses a transport proto not in
`enable_transports` — historically this caused a
silent indefinite hang during the registrar handshake
(the actor could never connect to register/discover).
Also,
- update `test_root_passes_tpt_to_sub` to detect a
proto mismatch between parametrized `tpt_proto_key`
and CLI `tpt_proto`, asserting the new guard raises
`ValueError` with expected msg content.
- replace old commented-out notes with a clearer
explanation of the mismatch foot-gun.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit d036ef7d7f)
Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError`
guard — under concurrent pytest sessions the sock-file can already be
reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before
`_shutdown.set()`, deadlocking `wait_for_shutdown()` on
`actor.cancel()`.
Also,
- close each endpoint independently in the finally so one raise doesn't
strand the rest.
- always signal `_shutdown.set()` regardless of remaining ep count.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 2ee44a6fdd)