Add cancel-cascade `TooSlowError` flake analysis

Document the ~0.3% rotating `trio.TooSlowError` flake under `--spawn-backend=main_thread_forkserver` full-suite runs. Root cause: `hard_kill`'s per-sub 1.6s graceful timeout compounding across N subactors in a cancel cascade, plus cumulative autouse-reaper teardown overhead. Covers symptom, observed flaking tests, root-cause family, ranked mitigations (cap bump -> CPU-count- aware cap -> `pytest-rerunfailures` -> `hard_kill` tuning -> targeted profiling), and a verification protocol. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 13:56:51 -04:00 · 2026-05-04 13:56:51 -04:00 · 60ce713016
parent 0ef549fadb
commit 60ce713016
1 changed files with 202 additions and 0 deletions
--- a/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md
+++ b/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md
@ -0,0 +1,202 @@
+# Cancel-cascade `trio.TooSlowError` flakes under `main_thread_forkserver`
+
+## Symptom
+
+Running the full test suite under
+
+```bash
+./py313/bin/python -m pytest tests/ \
+  --tpt-proto=tcp \
+  --spawn-backend=main_thread_forkserver
+```
+
+surfaces a single, **rotating** `trio.TooSlowError`
+failure each run. The failure isn't deterministic on
+test identity — different test each run — but it
+ALWAYS looks like:
+
+```
+FAILED tests/<file>::test_<name> - trio.TooSlowError
+==== 1 failed, 373 passed, 17 skipped, 11–12 xfailed,
+       0–1 xpassed, ~550 warnings in ~6min ====
+```
+
+Pass rate: **~99.7%** (373 of 374 non-skip tests).
+Wall-clock per full run: 5–6 min.
+
+## Tests observed flaking so far
+
+Each row was the SOLE failure in a separate run:
+
+| run # | test |
+|---|---|
+| 1 | `tests/test_advanced_streaming.py::test_dynamic_pub_sub[KeyboardInterrupt]` |
+| 2 | `tests/test_infected_asyncio.py::test_context_spawns_aio_task_that_errors[parent_actor_cancels_child=False]` |
+
+Both share the same shape:
+
+- **Cancel cascade** of N subactors back to a parent root actor.
+- N ≥ `multiprocessing.cpu_count()` for `test_dynamic_pub_sub`
+  (it spawns `cpus - 1` consumers + publisher + dynamic-consumer).
+- N ≈ 2 for `test_context_spawns_aio_task_that_errors` —
+  but each subactor is `infect_asyncio=True`, so each
+  cancel involves the trio↔asyncio guest-run unwind
+  which is structurally heavier than pure-trio.
+- Test wraps the cascade in `trio.fail_after(N seconds)`
+  and the cap fires before the cascade completes.
+
+The exact failing test rotates because each test is
+independently close to the cap; whichever happens to
+be unlucky in scheduling/CPU-contention on a given run
+is the one that times out.
+
+## Root-cause family
+
+`hard_kill` (`tractor/spawn/_spawn.py:hard_kill`) runs
+the SC-graceful teardown ladder per subactor:
+
+1. `Portal.cancel_actor()` — graceful IPC cancel-req.
+2. Wait `terminate_after=1.6s` for sub to exit.
+3. If still alive: `proc.kill()` (SIGKILL).
+4. (NEW) `_unlink_uds_bind_addrs()` — post-mortem
+   sock-file cleanup for UDS leaks (issue #452 fix).
+
+For a cascade of N subactors, each pays steps 1–4. If
+graceful-cancel doesn't complete within 1.6s for ANY
+sub, that sub eats a full 1.6s of `move_on_after` plus
+the `proc.wait()` post-SIGKILL.
+
+Worst case under fork backend with N=cpus subs:
+- N × 1.6s = 16s+ on a 10-core box just for the
+  graceful timeout phase
+- Plus per-spawn fork-IPC handshake cost compounds
+  during teardown (each sub's IPC cleanup goes through
+  the same forkserver coordinator)
+- Plus the new autouse fixtures
+  (`_track_orphaned_uds_per_test`,
+  `_detect_runaway_subactors_per_test`,
+  `_reap_orphaned_subactors`) all run at test
+  teardown, adding small (10s of ms) but cumulative
+  overhead
+
+Current cap: 30s (`fail_after_s = 30 if
+is_forking_spawner else 12`). Empirically fits the
+median run but the tail breaks ~0.3% of the time.
+
+## NOT regressing
+
+To confirm this is a flake and not a regression:
+
+- Pre-`WakeupSocketpair`-patch baseline: tests
+  HUNG INDEFINITELY (busy-loop never released).
+- Post-patch: pass-or-fail-fast, ~99.7% pass, the
+  occasional cap-hit fails in bounded time (<60s for
+  the offending test).
+- Same test PASSES under `--spawn-backend=trio`
+  (no fork, no hard-kill compounding).
+
+So the suite is dramatically better than before; the
+remaining flake is a known-tolerable steady-state.
+
+## Possible mitigations (ranked)
+
+### A. Bump the cap further
+
+Cheapest. Change the per-test `fail_after_s` from 30
+to e.g. 60 for fork backends. Pros: trivial. Cons:
+masks any genuine slowness regression we'd want to
+catch.
+
+### B. CPU-count-aware cap
+
+For tests whose N scales with `cpu_count()`, scale
+the cap too:
+
+```python
+fail_after_s = (
+    max(30, cpu_count() * 3)  # 3s/actor floor
+    if is_forking_spawner
+    else 12
+)
+```
+
+Pros: scales with the actual cancel-cascade work.
+Cons: still arbitrary multiplier.
+
+### C. `pytest-rerunfailures` for these tests only
+
+Mark the known-flaky tests with
+`@pytest.mark.flaky(reruns=1)` (needs
+`pytest-rerunfailures` dep). Single retry hides
+genuine ~0.3% transient flakes.
+
+Pros: no cap change, surfaces persistent failures
+loudly. Cons: adds a dep, retries can mask real bugs
+if used widely.
+
+### D. Reduce `hard_kill`'s `terminate_after`
+
+Drop from 1.6s → 0.8s. Cuts the worst-case cascade
+time roughly in half. Risks: fewer subs get a chance
+to run their cleanup before SIGKILL → more orphaned
+state for the autouse reapers to handle (ironically,
+adds back overhead elsewhere).
+
+### E. Profile + targeted fix
+
+Add `log.devx()` markers in `hard_kill` to time each
+phase. Identify if any subactor is consistently
+hitting the 1.6s cap (vs. exiting in <0.1s). If so,
+that sub has a teardown bug worth fixing at source.
+Pros: actually fixes the underlying slowness. Cons:
+real investigation work, deferred from this round.
+
+## Recommendation
+
+Land this issue-doc as the tracker. Apply **(B)** as
+a small follow-up — cheap and proportional. If it
+still flakes, escalate to **(E)** with a `log.devx()`
+profile-pass.
+
+`(C)` is a backstop if `(B)` doesn't quite get there
+and we need green CI faster than (E) can deliver.
+
+## Verification protocol
+
+After applying any mitigation:
+
+```bash
+# Run the suite N times back-to-back, count failures.
+# A persistent failure on the SAME test == real bug.
+# Failures rotating across tests == still cap-related.
+
+for i in $(seq 1 5); do
+  ./py313/bin/python -m pytest tests/ \
+    --tpt-proto=tcp \
+    --spawn-backend=main_thread_forkserver \
+    -q 2>&1 | tail -2
+done
+```
+
+Target: 0 failures across 5 runs ⇒ ship. 1–2 failures
+still rotating ⇒ apply (C). Same test failing twice
+⇒ escalate to (E).
+
+## See also
+
+- [#452](https://github.com/goodboy/tractor/issues/452) —
+  UDS sock-file leak (related — `hard_kill`'s
+  cleanup phase contributes to cascade time)
+- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
+  — the upstream-trio fix that turned this from a
+  100% hang into a 0.3% flake
+- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
+  — the asyncio variant which contributes to one of
+  the rotating failures
+- `tractor/spawn/_spawn.py::hard_kill` — the SIGKILL
+  cascade source
+- `tractor/_testing/_reap.py::_track_orphaned_uds_per_test`,
+  `_detect_runaway_subactors_per_test`,
+  `_reap_orphaned_subactors` — autouse cleanup
+  fixtures whose cumulative teardown overhead
+  contributes to the cascade time