Compare commits

...

10 Commits

Author SHA1 Message Date
Gud Boi 79dda4cb4a Mk `test_no_runtime()` not require `pytest-trio` 2026-05-13 20:43:22 -04:00
Gud Boi bd07a95d80 Use trace CM helpers in `test_dynamic_pub_sub`
Replace inline `trio.fail_after` + manual `signal.alarm` guard with the
`_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy
diag snapshot to disk on timeout.

Deats,
- inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM,
  captures on `TooSlowError`).
- outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync
  CM, captures on `SIGALRM`), only armed under fork backends.
  Extracts `_run_and_match()` helper to keep branching clean.
- bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes
  while diag harness accumulates evidence.
- drop `_DIAG_CAP_S` var + manual signal import (now internal to
  `afk_alarm_w_trace`).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 20:39:37 -04:00
Gud Boi 32955db02e Harden `test_cancellation` for fork-spawner backends
Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
  conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
  `track_orphaned_uds_per_test`,
  `detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
  MTF `xfail(strict=False)` with detailed race-window comment
  explaining the BEG shape mismatch, wrap body in
  `fail_after_w_trace` with per-backend timeout budget, bump
  `@tractor_test(timeout=10)`, drop old multiprocessing depth
  special-casing
- `test_multierror_fast_nursery`: wrap in
  `fail_after_w_trace(30.0)`, accept `TooSlowError` in
  `pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
  `spawn_backend` param for `is_forking_spawner`, widen
  `fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
  `test_cancel_infinite_streamer`, `test_some_cancels_all`: add
  `set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
  covered by module-level `pytestmark`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 20:10:02 -04:00
Gud Boi 5372fd178a Add snapshot evidence to cancel-cascade MTF issue doc
Append "Snapshot evidence (2026-05-13)" section to
`cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`
documenting `fail_after_w_trace` diag capture results for
`test_nested_multierrors` under the MTF backend — reproduction cmd,
ptree analysis, observed hang signature, and updated triage plan.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 20:02:02 -04:00
Gud Boi 01ce2857ea Add init-adopted orphan reap to `reap_subactors_per_test`
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 19:59:36 -04:00
Gud Boi 8de684f5de Add subtree-walk to `reap()` for full actor-tree teardown
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).

Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 19:53:25 -04:00
Gud Boi fb87c36263 Add hang-snapshot session index to pytest summary
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
  populated by `_do_capture_snapshot()` on each successful dump;
  add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
  prints all captured snapshot dirs at end-of-session so paths
  don't get buried in scrollback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 19:00:18 -04:00
Gud Boi e329c3108c Bump to latest `pytest` release! 2026-05-13 18:47:19 -04:00
Gud Boi 3a243a1fd4 Add stray-proc scan + refine `_testing.trace` capture
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
  `tractor._child` procs NOT in the walk's `seen` set — surfaces
  ghost subactor trees from prior test runs (cross-test launchpad
  contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
  into `_classify_walk()` closure, walk stray roots as additional
  trees, emit stray-root summary in header. Also: `tractor._child`
  procs reparented to init are now always classified as orphans
  regardless of cgroup-slice (leaked subactor ≠ desktop-launched
  app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
  `--capture=sys` redirection so snapshot paths always land on the
  real terminal
- `fail_after_w_trace()`: capture diag snapshot on
  non-`TooSlowError` exceptions when the `fail_after` scope's
  cancel had already fired (e.g. nursery wraps `Cancelled` into a
  `BaseExceptionGroup` that escapes before `TooSlowError` can be
  raised).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 18:46:04 -04:00
Gud Boi 7509e313ff Mv core impl `tractor_diag.xsh` to `_testing.trace`
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.

Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
  hung-state dumper, bindspace scanner, `dump_all()` snapshot
  archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
  (trio `fail_after` + auto-snapshot on `TooSlowError`),
  `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
  `SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
  import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
  `_testing.trace` — drops ~627 lines of inline impl. Add
  `acli.dump_all` alias for full snapshot-bundle CLI access.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 16:47:17 -04:00
10 changed files with 1930 additions and 742 deletions

View File

@ -182,6 +182,118 @@ Target: 0 failures across 5 runs ⇒ ship. 12 failures
still rotating ⇒ apply (C). Same test failing twice still rotating ⇒ apply (C). Same test failing twice
⇒ escalate to (E). ⇒ escalate to (E).
## Snapshot evidence (2026-05-13)
After landing the `fail_after_w_trace` /
`afk_alarm_w_trace` capture-on-timeout helpers
(`tractor._testing.trace`), `test_nested_multierrors`
on the `main_thread_forkserver` backend produces
**reproducible diag snapshots** at
`$XDG_CACHE_HOME/tractor/hung-dumps/test_nested_multierrors_start_method_main_thread_forkserver__<iso-ts>/`.
### Reproduction
```bash
pytest \
-v --verbose --durations=10 \
--spawn-backend=main_thread_forkserver \
--tpt-proto=uds \
--capture=sys --show-capture=stderr -rxX \
tests/test_cancellation.py::test_nested_multierrors
```
The test is `xfail(strict=False)` for MTF — it RUNS
each invocation so snapshots accumulate, but doesn't
break `--lf` workflow.
### Consistent shape across runs
5+ snapshots taken back-to-back show the SAME pattern:
- **Timing:** ~10s wall-clock total. Inner
`fail_after_w_trace(10)` fires at exactly T=10s;
cascade's `nursery.__aexit__` takes ~0.6s more to
gather + propagate the resulting
`BaseExceptionGroup`. **Trio backend completes the
SAME test in <6s** so the MTF cascade is ~2x
slower at minimum.
- **`BaseExceptionGroup` shape:** mixed
`[RemoteActorError, Cancelled]`. The first
subactor's natural error-propagation (`assert 0`
raised → `RemoteActorError` portal-result)
arrives before T=10s; the OTHER subactor's
portal-wait is still in flight at T=10s, gets
cancelled by `fail_after_w_trace`'s scope-cancel
→ returns `Cancelled` instead.
- **Orphan-spawn skew:** snapshot's `orphans` bucket
(after the `_is_tractor_subactor` cgroup-slice
override fix) consistently shows 2-4 init-adopted
procs at `depth_3` and `depth_1` levels — these
are the leaves whose parent (`depth_2` spawner)
was killed mid-cascade but who hadn't yet seen
the cancel signal themselves.
- **UDS sock-leak:** 2-6 dead-orphan socks per run
(varies with cascade timing). The
`track_orphaned_uds_per_test` fixture reaps them
post-test → contamination is isolated per-invocation.
### Capture mechanism
`fail_after_w_trace` covers two firing paths:
1. **`trio.TooSlowError`** raised at scope-exit
(body returned cleanly past deadline) — direct
`except` handler captures.
2. **Scope-cancel + body raises non-`Cancelled` exc**
(e.g. `nursery.__aexit__` wraps timeout-induced
`Cancelled` into a `BaseExceptionGroup` that
escapes before `trio.fail_after`'s exit-check
could fire `TooSlowError`) — body-raise `except`
handler checks `scope.cancel_called` and
captures if True. This path catches the
`test_nested_multierrors` shape specifically (see
"BaseExceptionGroup shape" above).
The snapshot dir contains:
- `trace.txt``ptree` + `hung_state` (kernel
`wchan`/`stack` + `py-spy dump --locals` when
sudo cached), with `include_strays=True`
surfacing any cross-test ghost subactor trees in
the `orphans` bucket.
- `bindspace.txt` — UDS bindspace classification
(live-active / orphaned-alive / orphaned-dead).
- `meta.json``{pid, label, captured_at, sudo_cached}`.
The end-of-session `pytest_terminal_summary` hook
in `tractor._testing.pytest` lists every snapshot
dir from the run so you don't have to scroll back
through captured-stderr lines:
```
========================= tractor hang-snapshot index ==========================
N `fail_after_w_trace` / `afk_alarm_w_trace` snapshot(s) captured this session:
<test-id>
→ /home/.../.cache/tractor/hung-dumps/<label>__<ts>
```
### Caveats
The snapshot fires AFTER the body-raise (not at the
exact moment of scope-cancel), so the parent's
py-spy frames show `_do_capture_snapshot` itself
running, NOT the cancel-cascade hang frame. To see
the actual hang state, manual `acli.ptree` /
`acli.hung_dump` from a second terminal at T=10s
would be needed — **not currently possible**
because per-test reaper fixtures clean up ~0.6s
post-timeout. See follow-up TODO in
`tractor/_testing/trace.py` for a
`TRACTOR_TRACE_HOLD=1` env-var pause mode.
## See also ## See also
- [#452](https://github.com/goodboy/tractor/issues/452) — - [#452](https://github.com/goodboy/tractor/issues/452) —

View File

@ -83,7 +83,13 @@ testing = [
# test suite # test suite
# TODO: maybe some of these layout choices? # TODO: maybe some of these layout choices?
# https://docs.pytest.org/en/8.0.x/explanation/goodpractices.html#choosing-a-test-layout-import-rules # https://docs.pytest.org/en/8.0.x/explanation/goodpractices.html#choosing-a-test-layout-import-rules
"pytest>=8.3.5", # bumped 8.3.5 → 9.0 per upstream security advisory + our
# local-only reliance on the post-9.0 capture-machinery shape
# (the `sys.__stderr__`-bypass print in
# `tractor._testing.trace._do_capture_snapshot` works on 8.x
# too, but standardizing on 9.x here ensures `--show-capture`
# interactions stay predictable across dev installs).
"pytest>=9.0",
"pexpect>=4.9.0,<5", "pexpect>=4.9.0,<5",
# per-test wall-clock bound (used via # per-test wall-clock bound (used via
# `@pytest.mark.timeout(..., method='thread')` on the # `@pytest.mark.timeout(..., method='thread')` on the

View File

@ -10,6 +10,10 @@ from typing import Type
import pytest import pytest
import trio import trio
import tractor import tractor
from tractor._testing.trace import (
AfkAlarmWTraceFactory,
FailAfterWTraceFactory,
)
def is_win(): def is_win():
@ -150,6 +154,9 @@ def test_dynamic_pub_sub(
is_forking_spawner: bool, is_forking_spawner: bool,
set_fork_aware_capture, set_fork_aware_capture,
fail_after_w_trace: FailAfterWTraceFactory,
afk_alarm_w_trace: AfkAlarmWTraceFactory,
): ):
failed_to_raise_report: str = ( failed_to_raise_report: str = (
f'Never got a {expect_cancel_exc!r} ??' f'Never got a {expect_cancel_exc!r} ??'
@ -167,42 +174,36 @@ def test_dynamic_pub_sub(
# a per-spawn cost (forkserver round-trip + IPC peer-handshake) # a per-spawn cost (forkserver round-trip + IPC peer-handshake)
# that can stack up over `cpus - 1` sequential `n.run_in_actor()` # that can stack up over `cpus - 1` sequential `n.run_in_actor()`
# calls — especially on UDS under cross-pytest contention # calls — especially on UDS under cross-pytest contention
# (#451 / #452). Empirically a flat 15s flakes on # (#451 / #452). 4s was flaking right at the edge under fork
# `main_thread_forkserver` for many-cpu hosts (a single bad # backends — bumped to 8s with diag-snapshot-on-timeout via
# spawn-stack puts total run-time at ~15.5s, just over); # `fail_after_w_trace` so a borderline run still fails loud
# 30s gives plenty of headroom while still failing-loud on # but lands a ptree/wchan/py-spy dump in
# a real hang. # `$XDG_CACHE_HOME/tractor/hung-dumps/` for inspection.
# #
# XXX caveat: this is an *inner* `trio.fail_after` — its # XXX caveat: this is an *inner* trio cancel — its `Cancelled`
# `Cancelled` cannot reach a task parked in a shielded `await` # cannot reach a task parked in a shielded `await` (e.g. inside
# (e.g. inside actor-nursery teardown). When the in-band cancel # actor-nursery teardown). When the in-band cancel path is
# path is itself buggy (the bug-class-3 `raise KBI` swallow we're # itself buggy (the bug-class-3 `raise KBI` swallow we're
# currently chasing) this guard does NOT fire and the test sits # currently chasing) this guard does NOT fire and the test
# forever until external SIGINT. The `_DIAG_CAP_S` outer guard # sits forever until external SIGINT. The `afk_alarm_w_trace`
# below is the AFK-safety counterpart. # outer guard below is the AFK-safety counterpart (SIGALRM
# raises in the main thread regardless of trio scope state).
fail_after_s: int = ( fail_after_s: int = (
4 8
if is_forking_spawner if is_forking_spawner
else 12 else 20
) )
# outer guard: when the inner fail_after fails to fire because of
# a shielded-await deadlock, this cap *aborts the trio run via
# signal.alarm → KBI* so AFK runs don't sit for >20min on the
# bug-class-3 hang. Slightly larger than `fail_after_s` so the
# trio-native path always wins when it works.
_DIAG_CAP_S: int = fail_after_s + 5
async def main(): async def main():
# bug-class-3 breadcrumb: tag each level of the cancel path # bug-class-3 breadcrumb: tag each level of the cancel path
# so when the run hangs and we capture cancel-level logs, the # so when the run hangs and we capture cancel-level logs, the
# *last* breadcrumb that fired names the swallow point. # *last* breadcrumb that fired names the swallow point.
test_log.cancel('test_dynamic_pub_sub: enter main()') test_log.cancel('test_dynamic_pub_sub: enter main()')
try: try:
with trio.fail_after(fail_after_s): async with fail_after_w_trace(fail_after_s):
test_log.cancel( test_log.cancel(
f'test_dynamic_pub_sub: ' f'test_dynamic_pub_sub: '
f'enter `trio.fail_after({fail_after_s})` scope' f'enter `fail_after_w_trace({fail_after_s})` scope'
) )
try: try:
async with tractor.open_nursery( async with tractor.open_nursery(
@ -258,15 +259,7 @@ def test_dynamic_pub_sub(
'test_dynamic_pub_sub: leaving `main()`' 'test_dynamic_pub_sub: leaving `main()`'
) )
# outer signal-based guard — survives a shielded-await deadlock def _run_and_match():
# since `signal.alarm` raises in the main thread regardless of
# trio's scope state. ONLY armed under fork-based backends since
# the bug we're chasing is MTF-specific.
import signal
armed_alarm: bool = bool(is_forking_spawner)
if armed_alarm:
signal.alarm(_DIAG_CAP_S)
try:
try: try:
trio.run(main) trio.run(main)
pytest.fail(failed_to_raise_report) pytest.fail(failed_to_raise_report)
@ -287,11 +280,19 @@ def test_dynamic_pub_sub(
pytest.fail(failed_to_raise_report) pytest.fail(failed_to_raise_report)
test_log.exception('Got user-cancel exc AS EXPECTED') test_log.exception('Got user-cancel exc AS EXPECTED')
finally:
# always disarm so a passing test doesn't get killed # outer SIGALRM-based guard — survives a shielded-await
# post-trio.run by a stale alarm. # deadlock since `signal.alarm` raises in the main thread
if armed_alarm: # regardless of trio's scope state, AND captures a full diag
signal.alarm(0) # snapshot to `$XDG_CACHE_HOME/tractor/hung-dumps/` before
# re-raising. ONLY armed under fork-based backends since the
# bug we're chasing is MTF-specific. Cap = `fail_after_s + 5`
# so the trio-native path always wins when it works.
if is_forking_spawner:
with afk_alarm_w_trace(fail_after_s + 5):
_run_and_match()
else:
_run_and_match()
@tractor.context @tractor.context

View File

@ -7,6 +7,7 @@ import signal
import platform import platform
import time import time
from itertools import repeat from itertools import repeat
from typing import Type
import pytest import pytest
import trio import trio
@ -14,6 +15,7 @@ import tractor
from tractor._testing import ( from tractor._testing import (
tractor_test, tractor_test,
) )
from tractor._testing.trace import FailAfterWTraceFactory
from .conftest import no_windows from .conftest import no_windows
@ -21,14 +23,44 @@ _non_linux: bool = platform.system() != 'Linux'
_friggin_windows: bool = platform.system() == 'Windows' _friggin_windows: bool = platform.system() == 'Windows'
pytestmark = pytest.mark.skipon_spawn_backend( pytestmark = [
# Multi-actor cancel cascades under
# `--spawn-backend=subint` trip the abandoned-subint
# GIL-hostage class — a stuck subint can starve the
# parent's trio loop and block cancel-delivery.
# Apply the skip module-wide rather than per-test
# since every test here exercises the same cascade.
pytest.mark.skipon_spawn_backend(
'subint', 'subint',
reason=( reason=(
'XXX SUBINT HANGING TEST XXX\n' 'XXX SUBINT GIL-CONTENTION HANGING TEST XXX\n'
'See oustanding issue(s)\n' 'Cancel cascades under '
# TODO, put issue link! '`--spawn-backend=subint` trip the abandoned-subint '
'GIL-hostage class — see\n'
' - `ai/conc-anal/subint_sigint_starvation_issue.md` '
'(GIL-hostage, SIGINT-unresponsive)\n'
' - `ai/conc-anal/subint_cancel_delivery_hang_issue.md` '
'(sibling: parent parks on dead chan)\n'
' - https://github.com/goodboy/tractor/issues/379 '
'(subint umbrella)\n'
) )
) ),
pytest.mark.usefixtures(
'reap_subactors_per_test',
# NOTE, cancellation tests stress the SIGKILL
# `hard_kill` path which leaks UDS sock-files when
# the subactor's IPC server `finally:` cleanup
# doesn't run. Track per-test for blame attribution.
'track_orphaned_uds_per_test',
# NOTE, cancel-cascade timing races (see
# `test_nested_multierrors`) can also leave a
# subactor spinning at 100% CPU when its cancel
# signal got swallowed mid-handshake. Catches the
# runaway-loop class that doesn't leak UDS socks
# but burns the box.
'detect_runaway_subactors_per_test',
),
]
async def assert_err(delay=0): async def assert_err(delay=0):
@ -55,7 +87,11 @@ async def do_nuthin():
], ],
ids=['no_args', 'unexpected_args'], ids=['no_args', 'unexpected_args'],
) )
def test_remote_error(reg_addr, args_err): def test_remote_error(
reg_addr: tuple,
args_err: tuple[dict, Type[Exception]],
set_fork_aware_capture,
):
''' '''
Verify an error raised in a subactor that is propagated Verify an error raised in a subactor that is propagated
to the parent nursery, contains the underlying boxed builtin to the parent nursery, contains the underlying boxed builtin
@ -120,17 +156,10 @@ def test_remote_error(reg_addr, args_err):
assert exc.boxed_type == errtype assert exc.boxed_type == errtype
# @pytest.mark.skipon_spawn_backend(
# 'subint',
# reason=(
# 'XXX SUBINT HANGING TEST XXX\n'
# 'See oustanding issue(s)\n'
# # TODO, put issue link!
# )
# )
def test_multierror( def test_multierror(
reg_addr: tuple[str, int], reg_addr: tuple[str, int],
start_method: str, start_method: str, # parametrized
set_fork_aware_capture, #: Callable,
): ):
''' '''
Verify we raise a ``BaseExceptionGroup`` out of a nursery where Verify we raise a ``BaseExceptionGroup`` out of a nursery where
@ -175,6 +204,8 @@ def test_multierror_fast_nursery(
start_method: str, start_method: str,
num_subactors: int, num_subactors: int,
delay: float, delay: float,
set_fork_aware_capture,
fail_after_w_trace: FailAfterWTraceFactory,
): ):
''' '''
Verify we raise a ``BaseExceptionGroup`` out of a nursery where Verify we raise a ``BaseExceptionGroup`` out of a nursery where
@ -183,6 +214,14 @@ def test_multierror_fast_nursery(
''' '''
async def main(): async def main():
# budget = 2× natural trio-backend cascade time for
# 25 errorer subactors (~14s observed). on-timeout
# diag snapshot → if the cancel cascade hangs
# (observed under MTF backend with N>=14 errorer
# subactors) we get a fresh ptree/wchan/py-spy dump
# on disk INSTEAD of an opaque pytest timeout-kill.
# See `tractor/_testing/trace.py` for the helper.
async with fail_after_w_trace(30.0):
async with tractor.open_nursery( async with tractor.open_nursery(
registry_addrs=[reg_addr], registry_addrs=[reg_addr],
) as nursery: ) as nursery:
@ -195,9 +234,23 @@ def test_multierror_fast_nursery(
) )
# with pytest.raises(trio.MultiError) as exc_info: # with pytest.raises(trio.MultiError) as exc_info:
with pytest.raises(BaseExceptionGroup) as exc_info: # NOTE, `trio.TooSlowError` from `fail_after_w_trace`
# bubbles UN-wrapped if `open_nursery.__aexit__` never
# gets re-entered; wrapped inside a `BaseExceptionGroup`
# if it did. Accept both shapes so the matcher itself
# doesn't lie about *what* failed.
with pytest.raises(
(BaseExceptionGroup, trio.TooSlowError),
) as exc_info:
trio.run(main) trio.run(main)
if isinstance(exc_info.value, trio.TooSlowError):
pytest.fail(
f'cancel cascade hung past 12s '
f'(num_subactors={num_subactors}, delay={delay}); '
f'see stderr for `fail_after_w_trace` snapshot path'
)
assert exc_info.type == ExceptionGroup assert exc_info.type == ExceptionGroup
err = exc_info.value err = exc_info.value
exceptions = err.exceptions exceptions = err.exceptions
@ -277,6 +330,7 @@ async def stream_forever():
async def test_cancel_infinite_streamer( async def test_cancel_infinite_streamer(
reg_addr: tuple, reg_addr: tuple,
start_method: str, start_method: str,
set_fork_aware_capture,
): ):
# stream for at most 1 seconds # stream for at most 1 seconds
with ( with (
@ -300,14 +354,6 @@ async def test_cancel_infinite_streamer(
assert n.cancelled assert n.cancelled
# @pytest.mark.skipon_spawn_backend(
# 'subint',
# reason=(
# 'XXX SUBINT HANGING TEST XXX\n'
# 'See oustanding issue(s)\n'
# # TODO, put issue link!
# )
# )
@pytest.mark.parametrize( @pytest.mark.parametrize(
'num_actors_and_errs', 'num_actors_and_errs',
[ [
@ -345,6 +391,7 @@ async def test_some_cancels_all(
reg_addr: tuple, reg_addr: tuple,
start_method: str, start_method: str,
loglevel: str, loglevel: str,
set_fork_aware_capture, #: Callable,
): ):
''' '''
Verify a subset of failed subactors causes all others in Verify a subset of failed subactors causes all others in
@ -462,32 +509,119 @@ async def spawn_and_error(
# 10, # 10,
# method='thread', # method='thread',
# ) # )
@tractor_test @pytest.mark.parametrize(
'depth',
[1, 3],
ids='depth={}'.format,
)
@tractor_test(
# bumped from the 30s default to cover fork-based
# cancel-cascade flakes; 2 spawners × 2 errorers × depth 1+
# cascade through 6 portal-wait_for_result paths each
# paying `terminate_after=1.6s` + UDS sock-unlink under
# MTF/UDS contention can easily blow past 30s.
# Trio backend is fast and won't notice the extra budget.
# See `ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`.
timeout=10,
)
async def test_nested_multierrors( async def test_nested_multierrors(
reg_addr: tuple, reg_addr: tuple,
loglevel: str, loglevel: str,
start_method: str, start_method: str,
set_fork_aware_capture,
fail_after_w_trace: FailAfterWTraceFactory,
request: pytest.FixtureRequest,
depth: int,
): ):
''' '''
Test that failed actor sets are wrapped in `BaseExceptionGroup`s. This Test that failed actor sets are wrapped in `BaseExceptionGroup`s.
test goes only 2 nurseries deep but we should eventually have tests
for arbitrary n-depth actor trees. Parametrized over recursion `depth {1, 3}`:
- `depth=1`: shallow tree (2 spawners × 2 errorers, 2
levels). Cascade completes well within budget on ALL
backends including MTF regression-safety green case.
- `depth=3`: deep tree (2 spawners × recursive depth-3
spawn-and-error). On `main_thread_forkserver` this
trips the cancel-cascade shape-mismatch bug class
(see `ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`)
xfailed below.
''' '''
if start_method == 'trio': # XXX: `multiprocessing.forkserver` can't handle nested
depth = 3 # spawning at any depth — hangs / broken-pipes. Pre-existing
subactor_breadth = 2 # backend limitation, NOT depth-specific.
else:
# XXX: multiprocessing can't seem to handle any more then 2 depth
# process trees for whatever reason.
# Any more process levels then this and we see bugs that cause
# hangs and broken pipes all over the place...
if start_method == 'forkserver': if start_method == 'forkserver':
pytest.skip("Forksever sux hard at nested spawning...") pytest.skip("Forksever sux hard at nested spawning...")
depth = 1 # means an additional actor tree of spawning (2 levels deep)
subactor_breadth = 2 subactor_breadth = 2
with trio.fail_after(120): # MTF backend trips a probabilistic timing race in the
# cancel-cascade — NOT depth-gated; depth amplifies the
# variance so depth=3 misses nearly every run while
# depth=1 misses occasionally. Both get the xfail mark
# (with `strict=False`) since the bug class can fire at
# either depth.
#
# The scenario in detail:
#
# T=0 spawn spawner_0 + spawner_1 in parallel
# T=t1 spawner_0's child errors →
# RemoteActorError reaches root nursery
# T=t1+ε root nursery starts cancelling
# spawner_1's portal-wait
# T=t2 spawner_1's child errors → tries to send
# RemoteActorError back
#
# if t2 < t1+ε: BEG = [RAE, RAE] ← clean (xpass)
# if t2 > t1+ε: BEG = [RAE, Cancelled] ← race tripped (xfail)
#
# i.e. the assertion below (`isinstance(_, RemoteActorError)`)
# fails iff cancel-delivery beats the other tree's natural
# error-propagation. Depth amplifies `t2-t1` variance
# (longer per-tree paths = more skew); under MTF the
# fork-spawn jitter + UDS-contention widens both `t1` and
# `t2` further.
#
# With `strict=False` the clean-cascade cases (most
# depth=1 runs, rare depth=3 runs) report as `xpassed`
# while the race-tripped cases report as `xfailed` —
# neither flakes `--lf`. When MTF cancel-cascade
# eventually speeds up enough to close the race even at
# depth=3, BOTH variants will reliably `xpass` and
# pytest will yell — our signal to drop the marker. See
# `ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`.
if start_method == 'main_thread_forkserver':
request.node.add_marker(
pytest.mark.xfail(
strict=False,
reason=(
f'MTF cancel-cascade shape-mismatch at '
f'depth={depth} (Cancelled races '
f'RemoteActorError in BEG); see conc-anal/'
'cancel_cascade_too_slow_under_main_thread_forkserver_issue.md'
),
)
)
# 6s budget: in the non-hang case (and on the trio
# backend) the whole spawn + cancel-cascade should
# complete in well under that. On the borderline hang
# case the `fail_after_w_trace` fires `TooSlowError`
# AND captures a ptree/wchan/py-spy snapshot to
# `$XDG_CACHE_HOME/tractor/hung-dumps/` for offline
# inspection. See
# `ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`.
match (start_method, depth):
case ('trio', _):
timeout = 6
case ('main_thread_forkserver', 1):
timeout = 16
case ('main_thread_forkserver', 3):
timeout = 30
async with fail_after_w_trace(timeout):
try: try:
async with tractor.open_nursery() as nursery: async with tractor.open_nursery() as nursery:
for i in range(subactor_breadth): for i in range(subactor_breadth):
@ -658,14 +792,6 @@ async def spawn_sub_with_sync_blocking_task():
print('exiting first subactor layer..\n') print('exiting first subactor layer..\n')
# @pytest.mark.skipon_spawn_backend(
# 'subint',
# reason=(
# 'XXX SUBINT HANGING TEST XXX\n'
# 'See oustanding issue(s)\n'
# # TODO, put issue link!
# )
# )
@pytest.mark.parametrize( @pytest.mark.parametrize(
'man_cancel_outer', 'man_cancel_outer',
[ [
@ -685,7 +811,7 @@ async def spawn_sub_with_sync_blocking_task():
def test_cancel_while_childs_child_in_sync_sleep( def test_cancel_while_childs_child_in_sync_sleep(
loglevel: str, loglevel: str,
start_method: str, start_method: str,
spawn_backend: str, is_forking_spawner: bool,
debug_mode: bool, debug_mode: bool,
reg_addr: tuple, reg_addr: tuple,
man_cancel_outer: bool, man_cancel_outer: bool,
@ -701,7 +827,10 @@ def test_cancel_while_childs_child_in_sync_sleep(
''' '''
if start_method == 'forkserver': if start_method == 'forkserver':
pytest.skip("Forksever sux hard at resuming from sync sleep...") pytest.skip(
"`multiprocessing`'s forkserver sux hard at "
"resuming from sync sleep..."
)
async def main(): async def main():
# #
@ -744,7 +873,11 @@ def test_cancel_while_childs_child_in_sync_sleep(
# delay = 2 # is AssertionError in eg AND no TooSlowError !? # delay = 2 # is AssertionError in eg AND no TooSlowError !?
# is AssertionError in eg AND no _cs cancellation. # is AssertionError in eg AND no _cs cancellation.
delay = ( delay = (
6 if _non_linux 6 if (
_non_linux
or
is_forking_spawner
)
else 4 else 4
) )

View File

@ -10,18 +10,22 @@ import tractor
from tractor._testing import tractor_test from tractor._testing import tractor_test
@pytest.mark.trio def test_no_runtime():
async def test_no_runtime(): '''
"""A registrar must be established before any nurseries A registrar must be established before any nurseries
can be created. can be created.
(In other words ``tractor.open_root_actor()`` must be (In other words ``tractor.open_root_actor()`` must be
engaged at some point?) engaged at some point?)
"""
with pytest.raises(RuntimeError) : '''
async def main():
async with tractor.find_actor('doggy'): async with tractor.find_actor('doggy'):
pass pass
with pytest.raises(tractor._exceptions.NoRuntime) :
trio.run(main)
@tractor_test @tractor_test
async def test_self_is_registered(reg_addr): async def test_self_is_registered(reg_addr):

View File

@ -463,11 +463,20 @@ def reap(
grace: float = 3.0, grace: float = 3.0,
poll: float = 0.25, poll: float = 0.25,
log=print, log=print,
include_descendants: bool = True,
) -> tuple[list[int], list[int]]: ) -> tuple[list[int], list[int]]:
''' '''
Deliver SIGINT to each pid, wait up to `grace` Deliver SIGINT to each pid (AND its subtree
seconds for them to exit, then SIGKILL any that descendants when `include_descendants=True`, the
survive. default), wait up to `grace` seconds for them to
exit, then SIGKILL any that survive.
The subtree-walk is what makes a single `acli.reap`
invocation tear down a *full* leaked actor-tree
rather than just its init-adopted top. Without it,
repeated calls are needed: each pass kills the
current `ppid==1` level, the level below becomes
init-adopted, next pass kills those, etc.
Returns `(signalled, survivors_killed)` so callers Returns `(signalled, survivors_killed)` so callers
can report / assert. can report / assert.
@ -480,8 +489,43 @@ def reap(
if not pids: if not pids:
return ([], []) return ([], [])
# Expand each pid into its full subtree (descendants
# included) so a multi-level leaked actor-tree gets
# torn down in a single pass. Falls back to the
# original `pids` list if psutil isn't installed.
pids_to_signal: list[int] = list(pids)
if include_descendants:
try:
import psutil
except ImportError:
psutil = None
if psutil is not None:
seen: set[int] = set(pids)
for root in list(pids):
try:
p = psutil.Process(root)
for c in p.children(recursive=True):
if c.pid not in seen:
seen.add(c.pid)
pids_to_signal.append(c.pid)
except (
psutil.NoSuchProcess,
psutil.AccessDenied,
):
# raced / unprivileged — skip silently;
# the orphan-root itself still gets the
# signal below.
continue
n_extra: int = len(pids_to_signal) - len(pids)
if n_extra:
log(
f'[tractor-reap] expanded {len(pids)} '
f'orphan-root(s) → {len(pids_to_signal)} '
f'incl. {n_extra} subtree-descendant(s)'
)
signalled: list[int] = [] signalled: list[int] = []
for pid in pids: for pid in pids_to_signal:
try: try:
os.kill(pid, signal.SIGINT) os.kill(pid, signal.SIGINT)
signalled.append(pid) signalled.append(pid)
@ -1111,6 +1155,19 @@ def reap_subactors_per_test() -> int:
(`_reap_orphaned_subactors`) only kicks in at session (`_reap_orphaned_subactors`) only kicks in at session
end which is too late to save the cascade. end which is too late to save the cascade.
Reaps both:
1. direct descendants of `pytest` (`PPid==pytest_pid`)
2. NEW init-adopted tractor procs (`PPid==1` AND
`_is_tractor_subactor`) that appeared between
pre-yield and post-yield these are the leaked
subactors whose mid-tier parent died during the
cascade, reparenting them to init.
Pre-yield snapshot of init-adopted tractor procs is
used to scope (2) to THIS test's leaks only — without
it we'd also reap orphans from concurrent unrelated
tractor uses on the box (piker, etc.).
Apply at module-level on the topically-problematic Apply at module-level on the topically-problematic
test files via: test files via:
@ -1130,7 +1187,16 @@ def reap_subactors_per_test() -> int:
''' '''
parent_pid: int = os.getpid() parent_pid: int = os.getpid()
# Snapshot pre-existing init-adopted tractor procs so
# we can scope post-test reap to NEW orphans only.
pre_orphans: set[int] = set(find_orphans())
yield parent_pid yield parent_pid
pids: list[int] = find_descendants(parent_pid) pids: list[int] = find_descendants(parent_pid)
new_orphans: list[int] = [
pid for pid in find_orphans()
if pid not in pre_orphans
]
if new_orphans:
pids.extend(new_orphans)
if pids: if pids:
reap(pids, grace=3.0) reap(pids, grace=3.0)

View File

@ -38,6 +38,20 @@ import tractor
from tractor.spawn._spawn import SpawnMethodKey from tractor.spawn._spawn import SpawnMethodKey
import trio import trio
# Re-export `_testing.trace`'s pytest fixtures so they're
# picked up by pytest's plugin-discovery (this module is
# loaded via `pytest_plugins` from `pyproject.toml`). The
# `noqa: F401` annotations make linters tolerate the
# unused-looking imports — they're load-bearing for pytest
# discovery. The fixtures share their `name=` kw with the
# underlying CM functions; the python-level identifiers
# below carry the `_fixture` suffix to avoid module-scope
# collision (see `_testing/trace.py` for details).
from .trace import ( # noqa: F401
afk_alarm_w_trace_fixture,
fail_after_w_trace_fixture,
)
# Sub-plugin: zombie-subactor + UDS sock-file + shm # Sub-plugin: zombie-subactor + UDS sock-file + shm
# reaping fixtures live in `tractor._testing._reap` # reaping fixtures live in `tractor._testing._reap`
# alongside the underlying detection/cleanup helpers. # alongside the underlying detection/cleanup helpers.
@ -751,3 +765,37 @@ def set_fork_aware_capture(
# request=request, # request=request,
# start_method=start_method, # start_method=start_method,
# ) # )
def pytest_terminal_summary(
terminalreporter,
exitstatus: int,
config: pytest.Config,
) -> None:
'''
End-of-session summary: list all
`fail_after_w_trace`/`afk_alarm_w_trace` snapshot dirs
captured during the run so the human doesn't have to scroll
back through captured-stderr lines to find dump paths.
Reads from `tractor._testing.trace._SNAPSHOT_INDEX` which is
populated by `_do_capture_snapshot()` on each successful
snapshot capture.
No-op when zero snapshots were captured (most sessions).
'''
from .trace import _SNAPSHOT_INDEX
if not _SNAPSHOT_INDEX:
return
tr = terminalreporter
tr.write_sep('=', 'tractor hang-snapshot index')
tr.write_line(
f'{len(_SNAPSHOT_INDEX)} `fail_after_w_trace` / '
f'`afk_alarm_w_trace` snapshot(s) captured this session:'
)
for label, path in _SNAPSHOT_INDEX:
tr.write_line(f' {label}')
tr.write_line(f'{path}')

File diff suppressed because it is too large Load Diff

11
uv.lock
View File

@ -559,17 +559,18 @@ wheels = [
[[package]] [[package]]
name = "pytest" name = "pytest"
version = "8.3.5" version = "9.0.3"
source = { registry = "https://pypi.org/simple" } source = { registry = "https://pypi.org/simple" }
dependencies = [ dependencies = [
{ name = "colorama", marker = "sys_platform == 'win32'" }, { name = "colorama", marker = "sys_platform == 'win32'" },
{ name = "iniconfig" }, { name = "iniconfig" },
{ name = "packaging" }, { name = "packaging" },
{ name = "pluggy" }, { name = "pluggy" },
{ name = "pygments" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/ae/3c/c9d525a414d506893f0cd8a8d0de7706446213181570cdbd766691164e40/pytest-8.3.5.tar.gz", hash = "sha256:f4efe70cc14e511565ac476b57c279e12a855b11f48f212af1080ef2263d3845", size = 1450891, upload-time = "2025-03-02T12:54:54.503Z" } sdist = { url = "https://files.pythonhosted.org/packages/7d/0d/549bd94f1a0a402dc8cf64563a117c0f3765662e2e668477624baeec44d5/pytest-9.0.3.tar.gz", hash = "sha256:b86ada508af81d19edeb213c681b1d48246c1a91d304c6c81a427674c17eb91c", size = 1572165, upload-time = "2026-04-07T17:16:18.027Z" }
wheels = [ wheels = [
{ url = "https://files.pythonhosted.org/packages/30/3d/64ad57c803f1fa1e963a7946b6e0fea4a70df53c1a7fed304586539c2bac/pytest-8.3.5-py3-none-any.whl", hash = "sha256:c69214aa47deac29fad6c2a4f590b9c4a9fdb16a403176fe154b79c0b4d4d820", size = 343634, upload-time = "2025-03-02T12:54:52.069Z" }, { url = "https://files.pythonhosted.org/packages/d4/24/a372aaf5c9b7208e7112038812994107bc65a84cd00e0354a88c2c77a617/pytest-9.0.3-py3-none-any.whl", hash = "sha256:2c5efc453d45394fdd706ade797c0a81091eccd1d6e4bccfcd476e2b8e0ab5d9", size = 375249, upload-time = "2026-04-07T17:16:16.13Z" },
] ]
[[package]] [[package]]
@ -792,7 +793,7 @@ dev = [
{ name = "prompt-toolkit", specifier = ">=3.0.50" }, { name = "prompt-toolkit", specifier = ">=3.0.50" },
{ name = "psutil", specifier = ">=7.0.0" }, { name = "psutil", specifier = ">=7.0.0" },
{ name = "pyperclip", specifier = ">=1.9.0" }, { name = "pyperclip", specifier = ">=1.9.0" },
{ name = "pytest", specifier = ">=8.3.5" }, { name = "pytest", specifier = ">=9.0" },
{ name = "pytest-timeout", specifier = ">=2.3" }, { name = "pytest-timeout", specifier = ">=2.3" },
{ name = "stackscope", specifier = ">=0.2.2,<0.3" }, { name = "stackscope", specifier = ">=0.2.2,<0.3" },
{ name = "typing-extensions", specifier = ">=4.14.1" }, { name = "typing-extensions", specifier = ">=4.14.1" },
@ -815,7 +816,7 @@ sync-pause = [{ name = "greenback", marker = "python_full_version == '3.13.*'",
testing = [ testing = [
{ name = "pexpect", specifier = ">=4.9.0,<5" }, { name = "pexpect", specifier = ">=4.9.0,<5" },
{ name = "psutil", specifier = ">=7.0.0" }, { name = "psutil", specifier = ">=7.0.0" },
{ name = "pytest", specifier = ">=8.3.5" }, { name = "pytest", specifier = ">=9.0" },
{ name = "pytest-timeout", specifier = ">=2.3" }, { name = "pytest-timeout", specifier = ">=2.3" },
] ]

View File

@ -17,6 +17,10 @@ Provides:
(e.g. `piker`, `tractor`); (e.g. `piker`, `tractor`);
path -> use as-is. path -> use as-is.
default: `$XDG_RUNTIME_DIR/tractor`. default: `$XDG_RUNTIME_DIR/tractor`.
- `acli.dump_all <pid> [--out-dir] full snapshot bundle —
[--label]` ptree + hung_dump + bindspace
written to a timestamped dir
for sharing / AI introspection.
- `acli.reap [opts]` SC-polite zombie-subactor - `acli.reap [opts]` SC-polite zombie-subactor
reaper + optional `/dev/shm/` reaper + optional `/dev/shm/`
+ UDS sock-file sweeps. + UDS sock-file sweeps.
@ -29,188 +33,30 @@ Or source directly:
source ./xontrib/tractor_diag.xsh source ./xontrib/tractor_diag.xsh
Pipe-to-paste idiom (xonsh): Pipe-to-paste idiom (xonsh):
hung-dump pytest |t /tmp/hung.log acli.hung_dump pytest |t /tmp/hung.log
The diagnostic core lives in `tractor._testing.trace` so it
can also be invoked from inside pytest tests (e.g. via
`fail_after_w_trace` / `afk_alarm_w_trace` capture-on-hang
helpers) — these aliases are just thin terminal wrappers.
Requires `psutil` for full functionality (`ptree` and the Requires `psutil` for full functionality (`ptree` and the
`hung-dump` tree-walk). Falls back to `pgrep -P` recursion `hung_dump` tree-walk). Falls back to `pgrep -P` recursion if
if missing. missing.
""" """
import os
import re
import subprocess as sp
from pathlib import Path from pathlib import Path
try: from tractor._testing.trace import (
import psutil dump_all as _dump_all,
except ImportError: dump_hung_state,
psutil = None dump_proc_tree,
print( resolve_pids,
'[tractor-diag] `psutil` missing — ' scan_bindspace,
'acli.ptree disabled, acli.hung_dump uses pgrep fallback. '
'`uv pip install psutil` for full functionality.'
)
# matches tractor's UDS sock naming: `<actor_name>@<pid>.sock`
_UDS_SOCK_RE = re.compile(
r'^(?P<name>.+)@(?P<pid>\d+)\.sock$'
) )
# --- helpers -------------------------------------------------- # --- ptree ----------------------------------------------------
def _resolve_pids(arg: str) -> list:
'''Resolve a numeric pid OR a `pgrep -f` pattern.'''
if arg.isdigit():
return [int(arg)]
try:
out = sp.check_output(
['pgrep', '-f', arg],
text=True,
)
except sp.CalledProcessError:
return []
return [int(p) for p in out.split() if p]
def _walk_tree_psutil(pid: int) -> list:
'''Flat list `[Process, *descendants]` via psutil.'''
try:
p = psutil.Process(pid)
except psutil.NoSuchProcess:
return []
return [p] + p.children(recursive=True)
def _walk_tree_with_depth(pid: int):
'''
Yield `(proc, depth)` pairs walking `pid`'s tree. `depth==0`
is the root; `depth==1` are direct children, etc. Used by
`ptree` to render parent/child relationships visually.
'''
try:
root = psutil.Process(pid)
except psutil.NoSuchProcess:
return
yield root, 0
stack: list = [(root, 0)]
seen: set = {pid}
while stack:
parent, d = stack.pop()
try:
kids = parent.children()
except psutil.NoSuchProcess:
continue
for k in kids:
if k.pid in seen:
continue
seen.add(k.pid)
yield k, d + 1
stack.append((k, d + 1))
def _which_cgroup_slice(pid: int) -> str|None:
'''
Return which top-level systemd cgroup slice `pid` is
rooted in, or `None` if it's not in either:
- `'system'`: under `/system.slice/...` — typically
`.service` units (long-lived daemons explicitly
enabled via `systemctl enable`, e.g.
`auto-cpufreq.service`, `dbus.service`,
`systemd-journald.service`).
- `'user'`: under `/user.slice/user-<uid>.slice/...`
— typically `.scope` units that systemd auto-wraps
around desktop-launched apps + login-session
procs (e.g. `app-firefox-<id>.scope`,
`session-<id>.scope`).
- `None`: NOT in either slice — pid 1 is NOT
managing this proc via cgroup. Combined with
`ppid==1`, this is the genuine "leaked / parent
died" orphan signal.
Both slice categories are by-design `ppid==1` (pid 1
is actively managing them) and should NOT be flagged
as concerning orphans, but distinguishing them is
useful: `system.slice` is "real services on this
box", `user.slice` is "stuff in your login session".
Returns `None` on any read error (proc gone, perm
denied, non-Linux, etc.) — callers should treat that
as "unknown, classify as plain orphan".
'''
try:
with open(f'/proc/{pid}/cgroup') as f:
cg: str = f.read()
except (
FileNotFoundError,
PermissionError,
ProcessLookupError,
OSError,
):
return None
if '/system.slice/' in cg:
return 'system'
if '/user.slice/' in cg:
return 'user'
return None
def _walk_tree_pgrep(pid: int) -> list:
'''psutil-less fallback — recursive `pgrep -P`.'''
out = [pid]
try:
kids = sp.check_output(
['pgrep', '-P', str(pid)],
text=True,
).split()
except sp.CalledProcessError:
return out
for k in kids:
out.extend(_walk_tree_pgrep(int(k)))
return out
def _ensure_sudo_cached() -> bool:
'''
Ensure `sudo` credentials are cached so subsequent
`sudo -n` calls succeed without prompting.
Returns True if cached (or successfully refreshed),
False if user cancelled or sudo is unavailable.
Tries `sudo -n true` first as a no-op probe; if that
fails, runs `sudo -v` which prompts interactively to
validate/refresh the credential timestamp.
'''
# probe — already cached?
cached = sp.run(
['sudo', '-n', 'true'],
capture_output=True,
).returncode == 0
if cached:
return True
print(
'[tractor-diag] needs `sudo` for /proc/<pid>/stack '
'and `py-spy dump`; caching creds via `sudo -v`...'
)
try:
rc = sp.run(['sudo', '-v']).returncode
except KeyboardInterrupt:
print(' cancelled — proceeding without sudo')
return False
except FileNotFoundError:
print(' sudo not on PATH — proceeding without sudo')
return False
return rc == 0
# --- ptree ---------------------------------------------------
def _ptree(args): def _ptree(args):
''' '''
@ -220,36 +66,9 @@ def _ptree(args):
usage: acli.ptree [--tree|-t] <pid|pgrep-pattern> [...] usage: acli.ptree [--tree|-t] <pid|pgrep-pattern> [...]
classification (per-proc, not per-tree): See `tractor._testing.trace.dump_proc_tree()` for the
bucket semantics + classification details.
- zombies: `status in (Z, X)` — defunct, parent
hasn't reaped (or kernel-marked dead).
- orphans: `ppid == 1` — original parent exited;
has been reparented to init. Includes
the *root* of an abandoned tree AND
any descendant that ended up reparented
to init mid-flight.
- live: real parent (`ppid > 1`), non-defunct.
Trees of orphan roots are still walked — their
descendants show as `live` if they themselves still
have a real (non-init) parent (the orphan root), but
the orphan root itself appears in `orphans`.
Cross-bucket parent annotation (always emitted):
when a row's parent (by ppid) lives in a *different*
severity bucket, the row is suffixed with
`[parent: <pid> (in `<bucket>`)]` so the visual
`└─` marker still resolves to a findable parent
even when bucketing scatters parent and child into
separate sections.
`--tree` / `-t` flag (opt-in):
additionally emit a flat walk-order `## tree`
section at the top — a contiguous parent-child
tree shape with no severity-grouping. Same procs,
no annotations needed because each parent appears
directly above its children.
''' '''
flag_tree: bool = False flag_tree: bool = False
pos_args: list = [] pos_args: list = []
@ -262,224 +81,19 @@ def _ptree(args):
if not pos_args: if not pos_args:
print('usage: acli.ptree [--tree|-t] <pid|pgrep-pattern> [...]') print('usage: acli.ptree [--tree|-t] <pid|pgrep-pattern> [...]')
return 1 return 1
if psutil is None:
print('ptree requires psutil; install via `uv pip install psutil`')
return 1
roots: list = [] roots: list = []
for a in pos_args: for a in pos_args:
roots.extend(_resolve_pids(a)) roots.extend(resolve_pids(a))
roots = sorted(set(roots)) roots = sorted(set(roots))
if not roots: if not roots:
print(f'(no procs match: {pos_args})') print(f'(no procs match: {pos_args})')
return 1 return 1
# statuses considered "defunct" — STATUS_ZOMBIE is the print(dump_proc_tree(roots, flag_tree=flag_tree), end='')
# common case (`Z`); STATUS_DEAD (`X`) is rarer but kernel-
# reported and equally not-coming-back.
defunct_statuses: set = {
psutil.STATUS_ZOMBIE,
getattr(psutil, 'STATUS_DEAD', 'dead'),
}
seen: set = set()
walk_order: list = [] # [(proc, depth)] preserved walk order
live: list = [] # [(proc, depth)]
orphans: list = []
# `ppid==1` AND rooted in `/system.slice/` cgroup —
# real systemd-managed services (e.g. `auto-cpufreq`,
# `NetworkManager`).
system_slice: list = []
# `ppid==1` AND rooted in `/user.slice/.../*.scope` —
# desktop-launched apps wrapped by systemd-user in
# transient `.scope` units (e.g. Firefox, browsers,
# editors started from a launcher).
user_slice: list = []
zombies: list = []
gone: list = []
# parent-bucket lookup populated post-classification so
# `_row()` can annotate cross-bucket parent refs.
pid_to_bucket: dict = {}
for r in roots:
for (p, depth) in _walk_tree_with_depth(r):
if p.pid in seen:
continue
seen.add(p.pid)
try:
status: str = p.status()
ppid: int = p.ppid()
except psutil.NoSuchProcess:
gone.append(p.pid)
continue
entry = (p, depth)
# severity order:
# zombie > orphan > system-slice > user-slice > live
# `ppid==1` splits into:
# - `system-slice` (rooted in `/system.slice/` —
# real services, by-design `ppid==1`)
# - `user-slice` (rooted in
# `/user.slice/.../*.scope` — desktop apps
# wrapped by systemd-user, by-design `ppid==1`)
# - `orphans` (everything else with `ppid==1` —
# genuinely concerning).
if status in defunct_statuses:
zombies.append(entry)
pid_to_bucket[p.pid] = 'zombies'
elif ppid == 1:
slice_kind: str|None = _which_cgroup_slice(p.pid)
if slice_kind == 'system':
system_slice.append(entry)
pid_to_bucket[p.pid] = 'system-slice'
elif slice_kind == 'user':
user_slice.append(entry)
pid_to_bucket[p.pid] = 'user-slice'
else:
orphans.append(entry)
pid_to_bucket[p.pid] = 'orphans'
else:
live.append(entry)
pid_to_bucket[p.pid] = 'live'
walk_order.append(entry)
total: int = (
len(live)
+ len(orphans)
+ len(system_slice)
+ len(user_slice)
+ len(zombies)
)
print(f'# ptree: {total} procs across roots {roots}')
hdr = ' ' + 'PID'.rjust(7) + ' ' + 'PPID'.rjust(7) + ' '
hdr += 'STATUS'.ljust(10) + ' CMD'
def _row(entry, bucket: str|None = None):
'''
Render `(proc, depth)` as an aligned row. Tree depth is
rendered as a `└─` marker on the CMD column so PID/PPID/
STATUS stay column-aligned.
When `bucket` is given AND the row's parent lives in a
*different* bucket, append a `[parent: <pid> (in `<b>`)]`
suffix so the `└─` marker can be resolved across the
severity-section split.
'''
p, depth = entry
tree_pfx = (' ' * depth) + ('└─ ' if depth > 0 else '')
# cross-bucket parent annotation; safe to compute up
# front because `p.ppid()` is cheap and rarely
# raises (parent pid is read from `/proc/<pid>/stat`,
# cached by psutil).
parent_anno: str = ''
if (
bucket is not None
and depth > 0
):
try:
parent_pid: int = p.ppid()
except psutil.NoSuchProcess:
parent_pid = 0
if parent_pid and parent_pid != 1:
parent_bucket: str|None = pid_to_bucket.get(parent_pid)
if (
parent_bucket is not None
and parent_bucket != bucket
):
parent_anno = (
f' [parent: {parent_pid} '
f'(in `{parent_bucket}`)]'
)
# NOTE: `psutil.ZombieProcess` is a *subclass* of
# `psutil.NoSuchProcess`, but the proc is NOT gone —
# it's a zombie whose `/proc/<pid>/cmdline` is empty/
# unreadable. Catch it FIRST so we still render a
# row (using fields that DO work on zombies: pid,
# ppid, status, name).
try:
cmd = ' '.join(p.cmdline())[:140] or '[' + p.name() + ']'
r = ' ' + str(p.pid).rjust(7)
r += ' ' + str(p.ppid()).rjust(7)
r += ' ' + p.status().ljust(10)
r += ' ' + tree_pfx + cmd + parent_anno
return r
except psutil.ZombieProcess:
try:
ppid_str = str(p.ppid())
name = p.name()
except psutil.NoSuchProcess:
ppid_str, name = '?', '?'
r = ' ' + str(p.pid).rjust(7)
r += ' ' + ppid_str.rjust(7)
r += ' ' + 'zombie'.ljust(10)
r += ' ' + tree_pfx + '[' + name + ' <defunct>]' + parent_anno
return r
except psutil.NoSuchProcess:
return ' ' + str(p.pid).rjust(7) + ' (gone mid-walk)'
def _section(
title: str,
procs: list,
hint: str = '',
bucket: str|None = None,
):
print(f'\n## {title} ({len(procs)})' + (f' — {hint}' if hint else ''))
if not procs:
print(' (none)')
return
print(hdr)
for p in procs:
print(_row(p, bucket=bucket))
# `--tree` opt-in: emit a flat walk-order section first
# so the parent-child tree shape is contiguous (no
# severity-grouping). No `bucket` arg → no cross-bucket
# annotation, since each parent appears directly above
# its children here.
if flag_tree:
_section(
'tree', walk_order,
'flat walk-order, parent-child preserved',
)
# severity-ordered: most concerning first. Each section
# passes its own `bucket` name so `_row()` can annotate
# rows whose parents live in a different section.
_section(
'zombies', zombies,
'status `Z`/`X`, parent has not reaped',
bucket='zombies',
)
_section(
'orphans', orphans,
'`ppid==1`, NOT in a `system.slice`/`user.slice` cgroup '
'(likely leaked / parent gone)',
bucket='orphans',
)
_section(
'system-slice', system_slice,
'`ppid==1`, rooted under `/system.slice/` '
'(real systemd-managed service — daemon, login '
'session manager, etc; not a leak)',
bucket='system-slice',
)
_section(
'user-slice', user_slice,
'`ppid==1`, rooted under `/user.slice/.../*.scope` '
'(desktop-launched app wrapped by systemd-user — '
'browser, editor, etc; not a leak)',
bucket='user-slice',
)
_section('live', live, bucket='live')
if gone:
print(f'\n## gone-during-walk ({len(gone)}): {gone}')
# --- hung-dump ------------------------------------------------ # --- hung-dump -----------------------------------------------
def _hung_dump(args): def _hung_dump(args):
''' '''
@ -489,248 +103,116 @@ def _hung_dump(args):
usage: acli.hung_dump <pid|pgrep-pattern> [...] usage: acli.hung_dump <pid|pgrep-pattern> [...]
note: `/proc/<pid>/stack` and `py-spy dump` typically note: `/proc/<pid>/stack` and `py-spy dump` typically
require CAP_SYS_PTRACE — invoked via `sudo -n`. run require CAP_SYS_PTRACE — invoked via `sudo -n`. If sudo
`sudo true` first to cache creds. isn't cached this alias prompts (via `sudo -v`); for the
non-interactive equivalent see
`tractor._testing.trace.dump_hung_state(allow_sudo_prompt=False)`.
''' '''
if not args: if not args:
print('usage: acli.hung_dump <pid|pgrep-pattern> [...]') print('usage: acli.hung_dump <pid|pgrep-pattern> [...]')
return 1 return 1
# cache sudo creds upfront so per-pid `sudo -n` calls
# for `cat /proc/<pid>/stack` and `py-spy dump` don't
# each prompt (or silently fail).
have_sudo: bool = _ensure_sudo_cached()
roots: list = [] roots: list = []
for a in args: for a in args:
roots.extend(_resolve_pids(a)) roots.extend(resolve_pids(a))
roots = sorted(set(roots)) roots = sorted(set(roots))
if not roots: if not roots:
print(f'(no procs match: {args})') print(f'(no procs match: {args})')
return 1 return 1
pids: list = []
seen: set = set()
for r in roots:
if psutil is not None:
walk = [p.pid for p in _walk_tree_psutil(r)]
else:
walk = _walk_tree_pgrep(r)
for pid in walk:
if pid not in seen:
seen.add(pid)
pids.append(pid)
print(f'# tree: {pids}')
print('\n## ps forest')
$[ps -o pid,ppid,pgid,stat,cmd -p @(','.join(map(str, pids)))]
for pid in pids:
print(f'\n## pid {pid}')
for f in ('wchan', 'stack'):
path = Path(f'/proc/{pid}/{f}')
try:
txt = path.read_text().rstrip()
print(f'-- /proc/{pid}/{f} --\n{txt}')
except PermissionError:
if not have_sudo:
print( print(
f'-- /proc/{pid}/{f}: ' dump_hung_state(roots, allow_sudo_prompt=True),
'PermissionError (no sudo) --' end='',
) )
continue
try:
txt = sp.check_output(
['sudo', '-n', 'cat', str(path)],
text=True,
stderr=sp.DEVNULL,
).rstrip()
print(f'-- /proc/{pid}/{f} (sudo) --\n{txt}')
except sp.CalledProcessError:
print(
f'-- /proc/{pid}/{f}: '
'sudo cred expired? rerun --'
)
except FileNotFoundError:
print(f'-- /proc/{pid}/{f}: proc gone --')
print(f'-- py-spy {pid} --')
if not have_sudo:
print(' (skipped — no sudo)')
continue
try:
$[sudo -n py-spy dump --pid @(pid) --locals]
except Exception as e:
print(f' (py-spy failed: {e})')
# --- bindspace-scan ------------------------------------------- # --- bindspace-scan ------------------------------------------
def _bindspace_scan(args): def _bindspace_scan(args):
''' '''
Scan a tractor UDS bindspace dir for orphan sock files Scan a tractor UDS bindspace dir for orphan sock files.
(those whose embedded `<pid>` no longer corresponds to
a live process).
usage: acli.bindspace_scan [<name>|<dir>] usage: acli.bindspace_scan [<name>|<dir>]
- no arg -> `$XDG_RUNTIME_DIR/tractor` See `tractor._testing.trace.scan_bindspace()` for full arg
(or `/run/user/<uid>/tractor`) semantics + output-bucket details.
- bare `<name>` -> `$XDG_RUNTIME_DIR/<name>`,
for projects like `piker` that bind
their own sibling sub-dir alongside
tractor's default
- path (abs or
containing `/`) -> use as-is
''' '''
runtime: str = os.environ.get( arg: str | None = args[0] if args else None
'XDG_RUNTIME_DIR', print(scan_bindspace(arg), end='')
f'/run/user/{os.getuid()}',
)
if args:
arg: str = args[0]
if (
arg.startswith('/')
or
'/' in arg
):
bs_dir = Path(arg)
else:
# bare name -> `$XDG_RUNTIME_DIR/<name>` so
# callers can say `acli.bindspace_scan piker`
bs_dir = Path(runtime) / arg
else:
bs_dir = Path(runtime) / 'tractor'
if not bs_dir.exists():
print(f'(no bindspace at {bs_dir})') # --- dump-all (snapshot bundle) ------------------------------
def _dump_all_alias(args):
'''
Capture a full diag snapshot bundle for a hung proc-tree
into a timestamped directory for offline / AI inspection.
usage: acli.dump_all <pid|pgrep-pat>
[--label <label>]
[--out-dir <path>]
Writes:
<out_dir>/<label>__<ts>/{trace.txt, bindspace.txt, meta.json}
Defaults:
--label = `manual`
--out-dir = `$XDG_CACHE_HOME/tractor/hung-dumps/`
(fallback `~/.cache/tractor/hung-dumps/`)
'''
import argparse
parser = argparse.ArgumentParser(
prog='acli.dump_all',
description=_dump_all_alias.__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
'target',
help='pid or pgrep -f pattern',
)
parser.add_argument(
'--label', '-l',
default='manual',
help='snapshot dir label prefix (default: `manual`)',
)
parser.add_argument(
'--out-dir', '-o',
type=Path,
default=None,
help='snapshot root dir (default: '
'$XDG_CACHE_HOME/tractor/hung-dumps/)',
)
try:
ns = parser.parse_args(args)
except SystemExit as se:
return int(se.code) if se.code is not None else 0
pids: list = resolve_pids(ns.target)
if not pids:
print(f'(no procs match: {ns.target})')
return 1 return 1
socks = sorted(bs_dir.glob('*.sock')) # snapshot scoped to ONE root — pick the first matched
print(f'## bindspace {bs_dir} ({len(socks)} sock file(s))') # pid. Multi-root snapshots can be done by invoking
# `acli.dump_all <pid>` per root.
live_active: list = [] # PID alive AND ppid != 1 root_pid: int = pids[0]
live_orphaned: list = [] # PID alive AND ppid == 1 (init-adopted) if len(pids) > 1:
dead_orphans: list = [] # PID gone, sock stale
bogus: list = []
def _ppid(pid: int) -> int | None:
'''
Read `/proc/<pid>/stat` -> ppid. Returns None on race
(proc died between `os.kill(pid, 0)` succeeding and this
read), permission errors, or non-linux.
'''
try:
with open(f'/proc/{pid}/stat') as f:
# field [3] of `man 5 proc` `/proc/<pid>/stat`
# NB: field [1] is `(comm)` which can contain
# spaces and parens — split from the *last*
# `)` to avoid that bullshit.
stat: str = f.read()
after_comm: str = stat.rsplit(')', 1)[1].strip()
return int(after_comm.split()[1]) # state(0) ppid(1)
except (FileNotFoundError, PermissionError, ProcessLookupError, OSError):
return None
for s in socks:
m = _UDS_SOCK_RE.match(s.name)
if not m:
bogus.append(s)
continue
pid = int(m['pid'])
name = m['name']
try:
os.kill(pid, 0)
except ProcessLookupError:
dead_orphans.append((s, pid, name))
continue
except PermissionError:
# exists but owned by another user — treat as live-active
# (we can't read its /proc/<pid>/stat to check ppid)
live_active.append((s, pid, name, None))
continue
# PID is alive in our euid view; classify by ppid
ppid: int | None = _ppid(pid)
if ppid == 1:
# adopted by init -> the original parent reaped
# without cleaning up this sub. Same class as
# what `acli.reap` detects.
live_orphaned.append((s, pid, name, ppid))
else:
live_active.append((s, pid, name, ppid))
print(f'\n## live-active ({len(live_active)}) — PID alive, parent still own it')
if not live_active:
print(' (none)')
for s, pid, name, ppid in live_active:
row = ' ' + str(pid).rjust(7)
row += ' ' + name.ljust(32)
row += ' ' + s.name
if ppid is not None:
row += f' (ppid={ppid})'
print(row)
print( print(
f'\n## orphaned-alive ({len(live_orphaned)}) ' f'[acli.dump_all] {len(pids)} pids matched '
f'— PID alive but `ppid==1`, parent reaped; ' f'{ns.target!r}; snapshotting tree from {root_pid} '
f'`acli.reap` candidate' f'(re-run per-pid for others: {pids[1:]})'
) )
if not live_orphaned:
print(' (none)')
for s, pid, name, ppid in live_orphaned:
row = ' ' + str(pid).rjust(7)
row += ' ' + name.ljust(32)
row += ' ' + s.name + ' (adopted by init)'
print(row)
print(f'\n## orphaned-dead ({len(dead_orphans)}) — PID gone, sock stale') dump_dir = _dump_all(
if not dead_orphans: root_pid,
print(' (none)') out_dir=ns.out_dir,
for s, pid, name in dead_orphans: label=ns.label,
row = ' ' + str(pid).rjust(7) allow_sudo_prompt=True, # CLI: ok to prompt
row += ' ' + name.ljust(32)
row += ' ' + s.name + ' (no live proc)'
print(row)
if bogus:
print(
f'\n## non-tractor ({len(bogus)}) '
f'— filename lacks `@<pid>` suffix, '
f'cannot determine liveness intrinsically'
) )
for s in bogus: print(f'[acli.dump_all] snapshot written to: {dump_dir}')
print(f' {s.name}')
# show a copy-pastable `ss` cmd per sock so the
# caller can resolve listener-PID externally
# (e.g. for piker's `chart.sock` / `pikerd.sock`
# style flat names). `ss -lpx 'src = <path>'`
# prints `users:(("<proc>",pid=<N>,fd=<M>))` for
# the listening side; empty output -> nobody's
# listening -> safe to unlink.
print(
'\nto check liveness manually '
'(needs `iproute2`/`ss`):'
)
for s in bogus:
print(f" ss -lpx 'src = {s}'")
if dead_orphans or live_orphaned:
print(
'\nto sweep BOTH orphaned-alive subs (graceful '
'SIGINT -> SIGKILL) AND dead-orphan socks in one shot:'
)
print(' acli.reap --uds')
if dead_orphans:
unlink_cmd = ' '.join(str(o[0]) for o in dead_orphans)
print(
'\n(or to unlink dead-orphan socks manually, '
"skipping `acli.reap`'s graceful-cancel ladder:)"
)
print(f' rm {unlink_cmd}')
# --- acli.reap ------------------------------------------------ # --- acli.reap ------------------------------------------------
@ -916,6 +398,7 @@ _TCLI_ALIASES: dict = {
'acli.ptree': _ptree, 'acli.ptree': _ptree,
'acli.hung_dump': _hung_dump, 'acli.hung_dump': _hung_dump,
'acli.bindspace_scan': _bindspace_scan, 'acli.bindspace_scan': _bindspace_scan,
'acli.dump_all': _dump_all_alias,
'acli.reap': _tractor_reap, 'acli.reap': _tractor_reap,
} }