From 5372fd178aaa93846843eebfdcd73fa4eb1b642d Mon Sep 17 00:00:00 2001 From: goodboy Date: Wed, 13 May 2026 20:02:02 -0400 Subject: [PATCH] Add snapshot evidence to cancel-cascade MTF issue doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Append "Snapshot evidence (2026-05-13)" section to `cancel_cascade_too_slow_under_main_thread_forkserver_issue.md` documenting `fail_after_w_trace` diag capture results for `test_nested_multierrors` under the MTF backend — reproduction cmd, ptree analysis, observed hang signature, and updated triage plan. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- ...slow_under_main_thread_forkserver_issue.md | 112 ++++++++++++++++++ 1 file changed, 112 insertions(+) diff --git a/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md b/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md index 780cbb67..7c3ae5b0 100644 --- a/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md +++ b/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md @@ -182,6 +182,118 @@ Target: 0 failures across 5 runs ⇒ ship. 1–2 failures still rotating ⇒ apply (C). Same test failing twice ⇒ escalate to (E). +## Snapshot evidence (2026-05-13) + +After landing the `fail_after_w_trace` / +`afk_alarm_w_trace` capture-on-timeout helpers +(`tractor._testing.trace`), `test_nested_multierrors` +on the `main_thread_forkserver` backend produces +**reproducible diag snapshots** at +`$XDG_CACHE_HOME/tractor/hung-dumps/test_nested_multierrors_start_method_main_thread_forkserver__/`. + +### Reproduction + +```bash +pytest \ + -v --verbose --durations=10 \ + --spawn-backend=main_thread_forkserver \ + --tpt-proto=uds \ + --capture=sys --show-capture=stderr -rxX \ + tests/test_cancellation.py::test_nested_multierrors +``` + +The test is `xfail(strict=False)` for MTF — it RUNS +each invocation so snapshots accumulate, but doesn't +break `--lf` workflow. + +### Consistent shape across runs + +5+ snapshots taken back-to-back show the SAME pattern: + +- **Timing:** ~10s wall-clock total. Inner + `fail_after_w_trace(10)` fires at exactly T=10s; + cascade's `nursery.__aexit__` takes ~0.6s more to + gather + propagate the resulting + `BaseExceptionGroup`. **Trio backend completes the + SAME test in <6s** — so the MTF cascade is ~2x + slower at minimum. + +- **`BaseExceptionGroup` shape:** mixed + `[RemoteActorError, Cancelled]`. The first + subactor's natural error-propagation (`assert 0` + raised → `RemoteActorError` portal-result) + arrives before T=10s; the OTHER subactor's + portal-wait is still in flight at T=10s, gets + cancelled by `fail_after_w_trace`'s scope-cancel + → returns `Cancelled` instead. + +- **Orphan-spawn skew:** snapshot's `orphans` bucket + (after the `_is_tractor_subactor` cgroup-slice + override fix) consistently shows 2-4 init-adopted + procs at `depth_3` and `depth_1` levels — these + are the leaves whose parent (`depth_2` spawner) + was killed mid-cascade but who hadn't yet seen + the cancel signal themselves. + +- **UDS sock-leak:** 2-6 dead-orphan socks per run + (varies with cascade timing). The + `track_orphaned_uds_per_test` fixture reaps them + post-test → contamination is isolated per-invocation. + +### Capture mechanism + +`fail_after_w_trace` covers two firing paths: + +1. **`trio.TooSlowError`** raised at scope-exit + (body returned cleanly past deadline) — direct + `except` handler captures. + +2. **Scope-cancel + body raises non-`Cancelled` exc** + (e.g. `nursery.__aexit__` wraps timeout-induced + `Cancelled` into a `BaseExceptionGroup` that + escapes before `trio.fail_after`'s exit-check + could fire `TooSlowError`) — body-raise `except` + handler checks `scope.cancel_called` and + captures if True. This path catches the + `test_nested_multierrors` shape specifically (see + "BaseExceptionGroup shape" above). + +The snapshot dir contains: +- `trace.txt` — `ptree` + `hung_state` (kernel + `wchan`/`stack` + `py-spy dump --locals` when + sudo cached), with `include_strays=True` + surfacing any cross-test ghost subactor trees in + the `orphans` bucket. +- `bindspace.txt` — UDS bindspace classification + (live-active / orphaned-alive / orphaned-dead). +- `meta.json` — `{pid, label, captured_at, sudo_cached}`. + +The end-of-session `pytest_terminal_summary` hook +in `tractor._testing.pytest` lists every snapshot +dir from the run so you don't have to scroll back +through captured-stderr lines: + +``` +========================= tractor hang-snapshot index ========================== +N `fail_after_w_trace` / `afk_alarm_w_trace` snapshot(s) captured this session: + + → /home/.../.cache/tractor/hung-dumps/