Add snapshot evidence to cancel-cascade MTF issue doc

Append "Snapshot evidence (2026-05-13)" section to `cancel_cascade_too_slow_under_main_thread_forkserver_issue.md` documenting `fail_after_w_trace` diag capture results for `test_nested_multierrors` under the MTF backend — reproduction cmd, ptree analysis, observed hang signature, and updated triage plan. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-13 20:02:02 -04:00 · 2026-05-13 20:02:02 -04:00 · 5372fd178a
parent 01ce2857ea
commit 5372fd178a
1 changed files with 112 additions and 0 deletions
--- a/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md
+++ b/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md
@ -182,6 +182,118 @@ Target: 0 failures across 5 runs ⇒ ship. 1–2 failures
 still rotating ⇒ apply (C). Same test failing twice
 ⇒ escalate to (E).

+## Snapshot evidence (2026-05-13)
+
+After landing the `fail_after_w_trace` /
+`afk_alarm_w_trace` capture-on-timeout helpers
+(`tractor._testing.trace`), `test_nested_multierrors`
+on the `main_thread_forkserver` backend produces
+**reproducible diag snapshots** at
+`$XDG_CACHE_HOME/tractor/hung-dumps/test_nested_multierrors_start_method_main_thread_forkserver__<iso-ts>/`.
+
+### Reproduction
+
+```bash
+pytest \
+  -v --verbose --durations=10 \
+  --spawn-backend=main_thread_forkserver \
+  --tpt-proto=uds \
+  --capture=sys --show-capture=stderr -rxX \
+  tests/test_cancellation.py::test_nested_multierrors
+```
+
+The test is `xfail(strict=False)` for MTF — it RUNS
+each invocation so snapshots accumulate, but doesn't
+break `--lf` workflow.
+
+### Consistent shape across runs
+
+5+ snapshots taken back-to-back show the SAME pattern:
+
+- **Timing:** ~10s wall-clock total. Inner
+  `fail_after_w_trace(10)` fires at exactly T=10s;
+  cascade's `nursery.__aexit__` takes ~0.6s more to
+  gather + propagate the resulting
+  `BaseExceptionGroup`. **Trio backend completes the
+  SAME test in <6s** — so the MTF cascade is ~2x
+  slower at minimum.
+
+- **`BaseExceptionGroup` shape:** mixed
+  `[RemoteActorError, Cancelled]`. The first
+  subactor's natural error-propagation (`assert 0`
+  raised → `RemoteActorError` portal-result)
+  arrives before T=10s; the OTHER subactor's
+  portal-wait is still in flight at T=10s, gets
+  cancelled by `fail_after_w_trace`'s scope-cancel
+  → returns `Cancelled` instead.
+
+- **Orphan-spawn skew:** snapshot's `orphans` bucket
+  (after the `_is_tractor_subactor` cgroup-slice
+  override fix) consistently shows 2-4 init-adopted
+  procs at `depth_3` and `depth_1` levels — these
+  are the leaves whose parent (`depth_2` spawner)
+  was killed mid-cascade but who hadn't yet seen
+  the cancel signal themselves.
+
+- **UDS sock-leak:** 2-6 dead-orphan socks per run
+  (varies with cascade timing). The
+  `track_orphaned_uds_per_test` fixture reaps them
+  post-test → contamination is isolated per-invocation.
+
+### Capture mechanism
+
+`fail_after_w_trace` covers two firing paths:
+
+1. **`trio.TooSlowError`** raised at scope-exit
+   (body returned cleanly past deadline) — direct
+   `except` handler captures.
+
+2. **Scope-cancel + body raises non-`Cancelled` exc**
+   (e.g. `nursery.__aexit__` wraps timeout-induced
+   `Cancelled` into a `BaseExceptionGroup` that
+   escapes before `trio.fail_after`'s exit-check
+   could fire `TooSlowError`) — body-raise `except`
+   handler checks `scope.cancel_called` and
+   captures if True. This path catches the
+   `test_nested_multierrors` shape specifically (see
+   "BaseExceptionGroup shape" above).
+
+The snapshot dir contains:
+- `trace.txt` — `ptree` + `hung_state` (kernel
+  `wchan`/`stack` + `py-spy dump --locals` when
+  sudo cached), with `include_strays=True`
+  surfacing any cross-test ghost subactor trees in
+  the `orphans` bucket.
+- `bindspace.txt` — UDS bindspace classification
+  (live-active / orphaned-alive / orphaned-dead).
+- `meta.json` — `{pid, label, captured_at, sudo_cached}`.
+
+The end-of-session `pytest_terminal_summary` hook
+in `tractor._testing.pytest` lists every snapshot
+dir from the run so you don't have to scroll back
+through captured-stderr lines:
+
+```
+========================= tractor hang-snapshot index ==========================
+N `fail_after_w_trace` / `afk_alarm_w_trace` snapshot(s) captured this session:
+  <test-id>
+    → /home/.../.cache/tractor/hung-dumps/<label>__<ts>
+```
+
+### Caveats
+
+The snapshot fires AFTER the body-raise (not at the
+exact moment of scope-cancel), so the parent's
+py-spy frames show `_do_capture_snapshot` itself
+running, NOT the cancel-cascade hang frame. To see
+the actual hang state, manual `acli.ptree` /
+`acli.hung_dump` from a second terminal at T=10s
+would be needed — **not currently possible**
+because per-test reaper fixtures clean up ~0.6s
+post-timeout. See follow-up TODO in
+`tractor/_testing/trace.py` for a
+`TRACTOR_TRACE_HOLD=1` env-var pause mode.
+
 ## See also

 - [#452](https://github.com/goodboy/tractor/issues/452) —