Add cancel-cascade `TooSlowError` flake analysis

Document the ~0.3% rotating `trio.TooSlowError` flake under `--spawn-backend=main_thread_forkserver` full-suite runs. Root cause: `hard_kill`'s per-sub 1.6s graceful timeout compounding across N subactors in a cancel cascade, plus cumulative autouse-reaper teardown overhead. Covers symptom, observed flaking tests, root-cause family, ranked mitigations (cap bump -> CPU-count- aware cap -> `pytest-rerunfailures` -> `hard_kill` tuning -> targeted profiling), and a verification protocol. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 13:56:51 -04:00 · 2026-05-04 13:56:51 -04:00 · 60ce713016
parent 0ef549fadb
commit 60ce713016
1 changed files with 202 additions and 0 deletions
--- a/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md
+++ b/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md
@ -0,0 +1,202 @@
 # Cancel-cascade `trio.TooSlowError` flakes under `main_thread_forkserver`
 ## Symptom
 Running the full test suite under
 ```bash
 ./py313/bin/python -m pytest tests/ \
  --tpt-proto=tcp \
  --spawn-backend=main_thread_forkserver
 ```
 surfaces a single, **rotating** `trio.TooSlowError`
 failure each run. The failure isn't deterministic on
 test identity — different test each run — but it
 ALWAYS looks like:
 ```
 FAILED tests/<file>::test_<name> - trio.TooSlowError
 ==== 1 failed, 373 passed, 17 skipped, 11–12 xfailed,
       0–1 xpassed, ~550 warnings in ~6min ====
 ```
 Pass rate: **~99.7%** (373 of 374 non-skip tests).
 Wall-clock per full run: 5–6 min.
 ## Tests observed flaking so far
 Each row was the SOLE failure in a separate run:
 | run # | test |
 |---|---|
 | 1 | `tests/test_advanced_streaming.py::test_dynamic_pub_sub[KeyboardInterrupt]` |
 | 2 | `tests/test_infected_asyncio.py::test_context_spawns_aio_task_that_errors[parent_actor_cancels_child=False]` |
 Both share the same shape:
 - **Cancel cascade** of N subactors back to a parent root actor.
 - N ≥ `multiprocessing.cpu_count()` for `test_dynamic_pub_sub`
  (it spawns `cpus - 1` consumers + publisher + dynamic-consumer).
 - N ≈ 2 for `test_context_spawns_aio_task_that_errors` —
  but each subactor is `infect_asyncio=True`, so each
  cancel involves the trio↔asyncio guest-run unwind
  which is structurally heavier than pure-trio.
 - Test wraps the cascade in `trio.fail_after(N seconds)`
  and the cap fires before the cascade completes.
 The exact failing test rotates because each test is
 independently close to the cap; whichever happens to
 be unlucky in scheduling/CPU-contention on a given run
 is the one that times out.
 ## Root-cause family
 `hard_kill` (`tractor/spawn/_spawn.py:hard_kill`) runs
 the SC-graceful teardown ladder per subactor:
 1. `Portal.cancel_actor()` — graceful IPC cancel-req.
 2. Wait `terminate_after=1.6s` for sub to exit.
 3. If still alive: `proc.kill()` (SIGKILL).
 4. (NEW) `_unlink_uds_bind_addrs()` — post-mortem
   sock-file cleanup for UDS leaks (issue #452 fix).
 For a cascade of N subactors, each pays steps 1–4. If
 graceful-cancel doesn't complete within 1.6s for ANY
 sub, that sub eats a full 1.6s of `move_on_after` plus
 the `proc.wait()` post-SIGKILL.
 Worst case under fork backend with N=cpus subs:
 - N × 1.6s = 16s+ on a 10-core box just for the
  graceful timeout phase
 - Plus per-spawn fork-IPC handshake cost compounds
  during teardown (each sub's IPC cleanup goes through
  the same forkserver coordinator)
 - Plus the new autouse fixtures
  (`_track_orphaned_uds_per_test`,
  `_detect_runaway_subactors_per_test`,
  `_reap_orphaned_subactors`) all run at test
  teardown, adding small (10s of ms) but cumulative
  overhead
 Current cap: 30s (`fail_after_s = 30 if
 is_forking_spawner else 12`). Empirically fits the
 median run but the tail breaks ~0.3% of the time.
 ## NOT regressing
 To confirm this is a flake and not a regression:
 - Pre-`WakeupSocketpair`-patch baseline: tests
  HUNG INDEFINITELY (busy-loop never released).
 - Post-patch: pass-or-fail-fast, ~99.7% pass, the
  occasional cap-hit fails in bounded time (<60s for
  the offending test).
 - Same test PASSES under `--spawn-backend=trio`
  (no fork, no hard-kill compounding).
 So the suite is dramatically better than before; the
 remaining flake is a known-tolerable steady-state.
 ## Possible mitigations (ranked)
 ### A. Bump the cap further
 Cheapest. Change the per-test `fail_after_s` from 30
 to e.g. 60 for fork backends. Pros: trivial. Cons:
 masks any genuine slowness regression we'd want to
 catch.
 ### B. CPU-count-aware cap
 For tests whose N scales with `cpu_count()`, scale
 the cap too:
 ```python
 fail_after_s = (
    max(30, cpu_count() * 3)  # 3s/actor floor
    if is_forking_spawner
    else 12
 )
 ```
 Pros: scales with the actual cancel-cascade work.
 Cons: still arbitrary multiplier.
 ### C. `pytest-rerunfailures` for these tests only
 Mark the known-flaky tests with
 `@pytest.mark.flaky(reruns=1)` (needs
 `pytest-rerunfailures` dep). Single retry hides
 genuine ~0.3% transient flakes.
 Pros: no cap change, surfaces persistent failures
 loudly. Cons: adds a dep, retries can mask real bugs
 if used widely.
 ### D. Reduce `hard_kill`'s `terminate_after`
 Drop from 1.6s → 0.8s. Cuts the worst-case cascade
 time roughly in half. Risks: fewer subs get a chance
 to run their cleanup before SIGKILL → more orphaned
 state for the autouse reapers to handle (ironically,
 adds back overhead elsewhere).
 ### E. Profile + targeted fix
 Add `log.devx()` markers in `hard_kill` to time each
 phase. Identify if any subactor is consistently
 hitting the 1.6s cap (vs. exiting in <0.1s). If so,
 that sub has a teardown bug worth fixing at source.
 Pros: actually fixes the underlying slowness. Cons:
 real investigation work, deferred from this round.
 ## Recommendation
 Land this issue-doc as the tracker. Apply **(B)** as
 a small follow-up — cheap and proportional. If it
 still flakes, escalate to **(E)** with a `log.devx()`
 profile-pass.
 `(C)` is a backstop if `(B)` doesn't quite get there
 and we need green CI faster than (E) can deliver.
 ## Verification protocol
 After applying any mitigation:
 ```bash
 # Run the suite N times back-to-back, count failures.
 # A persistent failure on the SAME test == real bug.
 # Failures rotating across tests == still cap-related.
 for i in $(seq 1 5); do
  ./py313/bin/python -m pytest tests/ \
    --tpt-proto=tcp \
    --spawn-backend=main_thread_forkserver \
    -q 2>&1 | tail -2
 done
 ```
 Target: 0 failures across 5 runs ⇒ ship. 1–2 failures
 still rotating ⇒ apply (C). Same test failing twice
 ⇒ escalate to (E).
 ## See also
 - [#452](https://github.com/goodboy/tractor/issues/452) —
  UDS sock-file leak (related — `hard_kill`'s
  cleanup phase contributes to cascade time)
 - `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
  — the upstream-trio fix that turned this from a
  100% hang into a 0.3% flake
 - `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
  — the asyncio variant which contributes to one of
  the rotating failures
 - `tractor/spawn/_spawn.py::hard_kill` — the SIGKILL
  cascade source
 - `tractor/_testing/_reap.py::_track_orphaned_uds_per_test`,
  `_detect_runaway_subactors_per_test`,
  `_reap_orphaned_subactors` — autouse cleanup
  fixtures whose cumulative teardown overhead
  contributes to the cascade time