Add snapshot evidence to cancel-cascade MTF issue doc

Append "Snapshot evidence (2026-05-13)" section to
`cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`
documenting `fail_after_w_trace` diag capture results for
`test_nested_multierrors` under the MTF backend — reproduction cmd,
ptree analysis, observed hang signature, and updated triage plan.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-05-13 20:02:02 -04:00
parent 01ce2857ea
commit 5372fd178a
1 changed files with 112 additions and 0 deletions

View File

@ -182,6 +182,118 @@ Target: 0 failures across 5 runs ⇒ ship. 12 failures
still rotating ⇒ apply (C). Same test failing twice
⇒ escalate to (E).
## Snapshot evidence (2026-05-13)
After landing the `fail_after_w_trace` /
`afk_alarm_w_trace` capture-on-timeout helpers
(`tractor._testing.trace`), `test_nested_multierrors`
on the `main_thread_forkserver` backend produces
**reproducible diag snapshots** at
`$XDG_CACHE_HOME/tractor/hung-dumps/test_nested_multierrors_start_method_main_thread_forkserver__<iso-ts>/`.
### Reproduction
```bash
pytest \
-v --verbose --durations=10 \
--spawn-backend=main_thread_forkserver \
--tpt-proto=uds \
--capture=sys --show-capture=stderr -rxX \
tests/test_cancellation.py::test_nested_multierrors
```
The test is `xfail(strict=False)` for MTF — it RUNS
each invocation so snapshots accumulate, but doesn't
break `--lf` workflow.
### Consistent shape across runs
5+ snapshots taken back-to-back show the SAME pattern:
- **Timing:** ~10s wall-clock total. Inner
`fail_after_w_trace(10)` fires at exactly T=10s;
cascade's `nursery.__aexit__` takes ~0.6s more to
gather + propagate the resulting
`BaseExceptionGroup`. **Trio backend completes the
SAME test in <6s** so the MTF cascade is ~2x
slower at minimum.
- **`BaseExceptionGroup` shape:** mixed
`[RemoteActorError, Cancelled]`. The first
subactor's natural error-propagation (`assert 0`
raised → `RemoteActorError` portal-result)
arrives before T=10s; the OTHER subactor's
portal-wait is still in flight at T=10s, gets
cancelled by `fail_after_w_trace`'s scope-cancel
→ returns `Cancelled` instead.
- **Orphan-spawn skew:** snapshot's `orphans` bucket
(after the `_is_tractor_subactor` cgroup-slice
override fix) consistently shows 2-4 init-adopted
procs at `depth_3` and `depth_1` levels — these
are the leaves whose parent (`depth_2` spawner)
was killed mid-cascade but who hadn't yet seen
the cancel signal themselves.
- **UDS sock-leak:** 2-6 dead-orphan socks per run
(varies with cascade timing). The
`track_orphaned_uds_per_test` fixture reaps them
post-test → contamination is isolated per-invocation.
### Capture mechanism
`fail_after_w_trace` covers two firing paths:
1. **`trio.TooSlowError`** raised at scope-exit
(body returned cleanly past deadline) — direct
`except` handler captures.
2. **Scope-cancel + body raises non-`Cancelled` exc**
(e.g. `nursery.__aexit__` wraps timeout-induced
`Cancelled` into a `BaseExceptionGroup` that
escapes before `trio.fail_after`'s exit-check
could fire `TooSlowError`) — body-raise `except`
handler checks `scope.cancel_called` and
captures if True. This path catches the
`test_nested_multierrors` shape specifically (see
"BaseExceptionGroup shape" above).
The snapshot dir contains:
- `trace.txt``ptree` + `hung_state` (kernel
`wchan`/`stack` + `py-spy dump --locals` when
sudo cached), with `include_strays=True`
surfacing any cross-test ghost subactor trees in
the `orphans` bucket.
- `bindspace.txt` — UDS bindspace classification
(live-active / orphaned-alive / orphaned-dead).
- `meta.json``{pid, label, captured_at, sudo_cached}`.
The end-of-session `pytest_terminal_summary` hook
in `tractor._testing.pytest` lists every snapshot
dir from the run so you don't have to scroll back
through captured-stderr lines:
```
========================= tractor hang-snapshot index ==========================
N `fail_after_w_trace` / `afk_alarm_w_trace` snapshot(s) captured this session:
<test-id>
→ /home/.../.cache/tractor/hung-dumps/<label>__<ts>
```
### Caveats
The snapshot fires AFTER the body-raise (not at the
exact moment of scope-cancel), so the parent's
py-spy frames show `_do_capture_snapshot` itself
running, NOT the cancel-cascade hang frame. To see
the actual hang state, manual `acli.ptree` /
`acli.hung_dump` from a second terminal at T=10s
would be needed — **not currently possible**
because per-test reaper fixtures clean up ~0.6s
post-timeout. See follow-up TODO in
`tractor/_testing/trace.py` for a
`TRACTOR_TRACE_HOLD=1` env-var pause mode.
## See also
- [#452](https://github.com/goodboy/tractor/issues/452) —