From 60ce7130163b870918bae18d9900556bed28aba2 Mon Sep 17 00:00:00 2001 From: goodboy Date: Mon, 4 May 2026 13:56:51 -0400 Subject: [PATCH] Add cancel-cascade `TooSlowError` flake analysis Document the ~0.3% rotating `trio.TooSlowError` flake under `--spawn-backend=main_thread_forkserver` full-suite runs. Root cause: `hard_kill`'s per-sub 1.6s graceful timeout compounding across N subactors in a cancel cascade, plus cumulative autouse-reaper teardown overhead. Covers symptom, observed flaking tests, root-cause family, ranked mitigations (cap bump -> CPU-count- aware cap -> `pytest-rerunfailures` -> `hard_kill` tuning -> targeted profiling), and a verification protocol. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- ...slow_under_main_thread_forkserver_issue.md | 202 ++++++++++++++++++ 1 file changed, 202 insertions(+) create mode 100644 ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md diff --git a/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md b/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md new file mode 100644 index 00000000..780cbb67 --- /dev/null +++ b/ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md @@ -0,0 +1,202 @@ +# Cancel-cascade `trio.TooSlowError` flakes under `main_thread_forkserver` + +## Symptom + +Running the full test suite under + +```bash +./py313/bin/python -m pytest tests/ \ + --tpt-proto=tcp \ + --spawn-backend=main_thread_forkserver +``` + +surfaces a single, **rotating** `trio.TooSlowError` +failure each run. The failure isn't deterministic on +test identity — different test each run — but it +ALWAYS looks like: + +``` +FAILED tests/::test_ - trio.TooSlowError +==== 1 failed, 373 passed, 17 skipped, 11–12 xfailed, + 0–1 xpassed, ~550 warnings in ~6min ==== +``` + +Pass rate: **~99.7%** (373 of 374 non-skip tests). +Wall-clock per full run: 5–6 min. + +## Tests observed flaking so far + +Each row was the SOLE failure in a separate run: + +| run # | test | +|---|---| +| 1 | `tests/test_advanced_streaming.py::test_dynamic_pub_sub[KeyboardInterrupt]` | +| 2 | `tests/test_infected_asyncio.py::test_context_spawns_aio_task_that_errors[parent_actor_cancels_child=False]` | + +Both share the same shape: + +- **Cancel cascade** of N subactors back to a parent root actor. +- N ≥ `multiprocessing.cpu_count()` for `test_dynamic_pub_sub` + (it spawns `cpus - 1` consumers + publisher + dynamic-consumer). +- N ≈ 2 for `test_context_spawns_aio_task_that_errors` — + but each subactor is `infect_asyncio=True`, so each + cancel involves the trio↔asyncio guest-run unwind + which is structurally heavier than pure-trio. +- Test wraps the cascade in `trio.fail_after(N seconds)` + and the cap fires before the cascade completes. + +The exact failing test rotates because each test is +independently close to the cap; whichever happens to +be unlucky in scheduling/CPU-contention on a given run +is the one that times out. + +## Root-cause family + +`hard_kill` (`tractor/spawn/_spawn.py:hard_kill`) runs +the SC-graceful teardown ladder per subactor: + +1. `Portal.cancel_actor()` — graceful IPC cancel-req. +2. Wait `terminate_after=1.6s` for sub to exit. +3. If still alive: `proc.kill()` (SIGKILL). +4. (NEW) `_unlink_uds_bind_addrs()` — post-mortem + sock-file cleanup for UDS leaks (issue #452 fix). + +For a cascade of N subactors, each pays steps 1–4. If +graceful-cancel doesn't complete within 1.6s for ANY +sub, that sub eats a full 1.6s of `move_on_after` plus +the `proc.wait()` post-SIGKILL. + +Worst case under fork backend with N=cpus subs: +- N × 1.6s = 16s+ on a 10-core box just for the + graceful timeout phase +- Plus per-spawn fork-IPC handshake cost compounds + during teardown (each sub's IPC cleanup goes through + the same forkserver coordinator) +- Plus the new autouse fixtures + (`_track_orphaned_uds_per_test`, + `_detect_runaway_subactors_per_test`, + `_reap_orphaned_subactors`) all run at test + teardown, adding small (10s of ms) but cumulative + overhead + +Current cap: 30s (`fail_after_s = 30 if +is_forking_spawner else 12`). Empirically fits the +median run but the tail breaks ~0.3% of the time. + +## NOT regressing + +To confirm this is a flake and not a regression: + +- Pre-`WakeupSocketpair`-patch baseline: tests + HUNG INDEFINITELY (busy-loop never released). +- Post-patch: pass-or-fail-fast, ~99.7% pass, the + occasional cap-hit fails in bounded time (<60s for + the offending test). +- Same test PASSES under `--spawn-backend=trio` + (no fork, no hard-kill compounding). + +So the suite is dramatically better than before; the +remaining flake is a known-tolerable steady-state. + +## Possible mitigations (ranked) + +### A. Bump the cap further + +Cheapest. Change the per-test `fail_after_s` from 30 +to e.g. 60 for fork backends. Pros: trivial. Cons: +masks any genuine slowness regression we'd want to +catch. + +### B. CPU-count-aware cap + +For tests whose N scales with `cpu_count()`, scale +the cap too: + +```python +fail_after_s = ( + max(30, cpu_count() * 3) # 3s/actor floor + if is_forking_spawner + else 12 +) +``` + +Pros: scales with the actual cancel-cascade work. +Cons: still arbitrary multiplier. + +### C. `pytest-rerunfailures` for these tests only + +Mark the known-flaky tests with +`@pytest.mark.flaky(reruns=1)` (needs +`pytest-rerunfailures` dep). Single retry hides +genuine ~0.3% transient flakes. + +Pros: no cap change, surfaces persistent failures +loudly. Cons: adds a dep, retries can mask real bugs +if used widely. + +### D. Reduce `hard_kill`'s `terminate_after` + +Drop from 1.6s → 0.8s. Cuts the worst-case cascade +time roughly in half. Risks: fewer subs get a chance +to run their cleanup before SIGKILL → more orphaned +state for the autouse reapers to handle (ironically, +adds back overhead elsewhere). + +### E. Profile + targeted fix + +Add `log.devx()` markers in `hard_kill` to time each +phase. Identify if any subactor is consistently +hitting the 1.6s cap (vs. exiting in <0.1s). If so, +that sub has a teardown bug worth fixing at source. +Pros: actually fixes the underlying slowness. Cons: +real investigation work, deferred from this round. + +## Recommendation + +Land this issue-doc as the tracker. Apply **(B)** as +a small follow-up — cheap and proportional. If it +still flakes, escalate to **(E)** with a `log.devx()` +profile-pass. + +`(C)` is a backstop if `(B)` doesn't quite get there +and we need green CI faster than (E) can deliver. + +## Verification protocol + +After applying any mitigation: + +```bash +# Run the suite N times back-to-back, count failures. +# A persistent failure on the SAME test == real bug. +# Failures rotating across tests == still cap-related. + +for i in $(seq 1 5); do + ./py313/bin/python -m pytest tests/ \ + --tpt-proto=tcp \ + --spawn-backend=main_thread_forkserver \ + -q 2>&1 | tail -2 +done +``` + +Target: 0 failures across 5 runs ⇒ ship. 1–2 failures +still rotating ⇒ apply (C). Same test failing twice +⇒ escalate to (E). + +## See also + +- [#452](https://github.com/goodboy/tractor/issues/452) — + UDS sock-file leak (related — `hard_kill`'s + cleanup phase contributes to cascade time) +- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md` + — the upstream-trio fix that turned this from a + 100% hang into a 0.3% flake +- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md` + — the asyncio variant which contributes to one of + the rotating failures +- `tractor/spawn/_spawn.py::hard_kill` — the SIGKILL + cascade source +- `tractor/_testing/_reap.py::_track_orphaned_uds_per_test`, + `_detect_runaway_subactors_per_test`, + `_reap_orphaned_subactors` — autouse cleanup + fixtures whose cumulative teardown overhead + contributes to the cascade time