Add snapshot evidence to cancel-cascade MTF issue doc
Append "Snapshot evidence (2026-05-13)" section to `cancel_cascade_too_slow_under_main_thread_forkserver_issue.md` documenting `fail_after_w_trace` diag capture results for `test_nested_multierrors` under the MTF backend — reproduction cmd, ptree analysis, observed hang signature, and updated triage plan. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-codesubint_forkserver_backend
parent
01ce2857ea
commit
5372fd178a
|
|
@ -182,6 +182,118 @@ Target: 0 failures across 5 runs ⇒ ship. 1–2 failures
|
||||||
still rotating ⇒ apply (C). Same test failing twice
|
still rotating ⇒ apply (C). Same test failing twice
|
||||||
⇒ escalate to (E).
|
⇒ escalate to (E).
|
||||||
|
|
||||||
|
## Snapshot evidence (2026-05-13)
|
||||||
|
|
||||||
|
After landing the `fail_after_w_trace` /
|
||||||
|
`afk_alarm_w_trace` capture-on-timeout helpers
|
||||||
|
(`tractor._testing.trace`), `test_nested_multierrors`
|
||||||
|
on the `main_thread_forkserver` backend produces
|
||||||
|
**reproducible diag snapshots** at
|
||||||
|
`$XDG_CACHE_HOME/tractor/hung-dumps/test_nested_multierrors_start_method_main_thread_forkserver__<iso-ts>/`.
|
||||||
|
|
||||||
|
### Reproduction
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pytest \
|
||||||
|
-v --verbose --durations=10 \
|
||||||
|
--spawn-backend=main_thread_forkserver \
|
||||||
|
--tpt-proto=uds \
|
||||||
|
--capture=sys --show-capture=stderr -rxX \
|
||||||
|
tests/test_cancellation.py::test_nested_multierrors
|
||||||
|
```
|
||||||
|
|
||||||
|
The test is `xfail(strict=False)` for MTF — it RUNS
|
||||||
|
each invocation so snapshots accumulate, but doesn't
|
||||||
|
break `--lf` workflow.
|
||||||
|
|
||||||
|
### Consistent shape across runs
|
||||||
|
|
||||||
|
5+ snapshots taken back-to-back show the SAME pattern:
|
||||||
|
|
||||||
|
- **Timing:** ~10s wall-clock total. Inner
|
||||||
|
`fail_after_w_trace(10)` fires at exactly T=10s;
|
||||||
|
cascade's `nursery.__aexit__` takes ~0.6s more to
|
||||||
|
gather + propagate the resulting
|
||||||
|
`BaseExceptionGroup`. **Trio backend completes the
|
||||||
|
SAME test in <6s** — so the MTF cascade is ~2x
|
||||||
|
slower at minimum.
|
||||||
|
|
||||||
|
- **`BaseExceptionGroup` shape:** mixed
|
||||||
|
`[RemoteActorError, Cancelled]`. The first
|
||||||
|
subactor's natural error-propagation (`assert 0`
|
||||||
|
raised → `RemoteActorError` portal-result)
|
||||||
|
arrives before T=10s; the OTHER subactor's
|
||||||
|
portal-wait is still in flight at T=10s, gets
|
||||||
|
cancelled by `fail_after_w_trace`'s scope-cancel
|
||||||
|
→ returns `Cancelled` instead.
|
||||||
|
|
||||||
|
- **Orphan-spawn skew:** snapshot's `orphans` bucket
|
||||||
|
(after the `_is_tractor_subactor` cgroup-slice
|
||||||
|
override fix) consistently shows 2-4 init-adopted
|
||||||
|
procs at `depth_3` and `depth_1` levels — these
|
||||||
|
are the leaves whose parent (`depth_2` spawner)
|
||||||
|
was killed mid-cascade but who hadn't yet seen
|
||||||
|
the cancel signal themselves.
|
||||||
|
|
||||||
|
- **UDS sock-leak:** 2-6 dead-orphan socks per run
|
||||||
|
(varies with cascade timing). The
|
||||||
|
`track_orphaned_uds_per_test` fixture reaps them
|
||||||
|
post-test → contamination is isolated per-invocation.
|
||||||
|
|
||||||
|
### Capture mechanism
|
||||||
|
|
||||||
|
`fail_after_w_trace` covers two firing paths:
|
||||||
|
|
||||||
|
1. **`trio.TooSlowError`** raised at scope-exit
|
||||||
|
(body returned cleanly past deadline) — direct
|
||||||
|
`except` handler captures.
|
||||||
|
|
||||||
|
2. **Scope-cancel + body raises non-`Cancelled` exc**
|
||||||
|
(e.g. `nursery.__aexit__` wraps timeout-induced
|
||||||
|
`Cancelled` into a `BaseExceptionGroup` that
|
||||||
|
escapes before `trio.fail_after`'s exit-check
|
||||||
|
could fire `TooSlowError`) — body-raise `except`
|
||||||
|
handler checks `scope.cancel_called` and
|
||||||
|
captures if True. This path catches the
|
||||||
|
`test_nested_multierrors` shape specifically (see
|
||||||
|
"BaseExceptionGroup shape" above).
|
||||||
|
|
||||||
|
The snapshot dir contains:
|
||||||
|
- `trace.txt` — `ptree` + `hung_state` (kernel
|
||||||
|
`wchan`/`stack` + `py-spy dump --locals` when
|
||||||
|
sudo cached), with `include_strays=True`
|
||||||
|
surfacing any cross-test ghost subactor trees in
|
||||||
|
the `orphans` bucket.
|
||||||
|
- `bindspace.txt` — UDS bindspace classification
|
||||||
|
(live-active / orphaned-alive / orphaned-dead).
|
||||||
|
- `meta.json` — `{pid, label, captured_at, sudo_cached}`.
|
||||||
|
|
||||||
|
The end-of-session `pytest_terminal_summary` hook
|
||||||
|
in `tractor._testing.pytest` lists every snapshot
|
||||||
|
dir from the run so you don't have to scroll back
|
||||||
|
through captured-stderr lines:
|
||||||
|
|
||||||
|
```
|
||||||
|
========================= tractor hang-snapshot index ==========================
|
||||||
|
N `fail_after_w_trace` / `afk_alarm_w_trace` snapshot(s) captured this session:
|
||||||
|
<test-id>
|
||||||
|
→ /home/.../.cache/tractor/hung-dumps/<label>__<ts>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Caveats
|
||||||
|
|
||||||
|
The snapshot fires AFTER the body-raise (not at the
|
||||||
|
exact moment of scope-cancel), so the parent's
|
||||||
|
py-spy frames show `_do_capture_snapshot` itself
|
||||||
|
running, NOT the cancel-cascade hang frame. To see
|
||||||
|
the actual hang state, manual `acli.ptree` /
|
||||||
|
`acli.hung_dump` from a second terminal at T=10s
|
||||||
|
would be needed — **not currently possible**
|
||||||
|
because per-test reaper fixtures clean up ~0.6s
|
||||||
|
post-timeout. See follow-up TODO in
|
||||||
|
`tractor/_testing/trace.py` for a
|
||||||
|
`TRACTOR_TRACE_HOLD=1` env-var pause mode.
|
||||||
|
|
||||||
## See also
|
## See also
|
||||||
|
|
||||||
- [#452](https://github.com/goodboy/tractor/issues/452) —
|
- [#452](https://github.com/goodboy/tractor/issues/452) —
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue