tractor/tests/trionics/test_patches.py

100 lines
2.9 KiB
Python
Raw Normal View History

Add `tractor.trionics.patches` subpkg + first fix With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which can busy-loop due to lack of handling `EOF`. New `tractor.trionics.patches` subpkg housing defensive monkey-patches for upstream `trio` bugs we've encountered while running `tractor` — particularly as of recent, fork-survival edge cases that haven't been filed/fixed upstream yet. Each patch is idempotent, version-gated via `is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the upstream release whose adoption allows deletion. Subpkg layout + per-patch contract documented in `tractor/trionics/patches/README.md` — `apply()` / `is_needed()` / `repro()` API, registry pattern via `_PATCHES` in `__init__.py`, single-call entry point `apply_all()`. First patch, `_wakeup_socketpair`: - `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN). - under `fork()`-spawning backends the COW-inherited socketpair fds & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair` instance whose write-end is closed, and `drain()` then **spins forever in C with no Python checkpoints**, - this obviously burns 100% CPU and no signal delivery. Standalone repro: from trio._core._wakeup_socketpair import WakeupSocketpair ws = WakeupSocketpair() ws.write_sock.close() ws.drain() # spins forever Patch is one-line — break the drain loop on b'' EOF. Manifested as two distinct test failures: - `tests/test_multi_program.py::test_register_duplicate_name` hung at 100% CPU on the busy-loop directly (fork child's worker thread) - `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`, both threads parked in `epoll_wait`, no TCP connect-back to parent ever happened. Same patch fixes both. Restored 99.7% pass rate on full suite under `--spawn-backend=main_thread_forkserver` (was hanging indefinitely before). Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE any trio runtime init. Harmless on non-fork backends. Conc-anal write-ups, including strace + py-spy evidence: - `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md` - `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md` Regression tests in `tests/trionics/test_patches.py`: each test asserts (a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b) the patch fixes it with a SIGALRM wall-clock cap so a regression hangs loud instead of silently. TODO: - [ ] file the upstream `python-trio/trio` issue + PR. - [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue body's evidence section. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 16:03:57 +00:00
'''
Regression tests for `tractor.trionics.patches`
defensive monkey-patches on upstream `trio` bugs.
Each test asserts:
1. The bug exists (or is gone skip cleanly if
upstream shipped the fix and our `is_needed()` now
returns `False`).
2. Our patch fixes it (post-`apply()` the `repro()`
returns cleanly within a tight wall-clock cap).
Wall-clock caps are critical here the bugs we patch
are tight-loops or deadlocks, so a regression would
HANG the test runner unless we hard-cap each
`repro()` call.
'''
import signal
import pytest
from tractor.trionics import patches
from tractor.trionics.patches import _wakeup_socketpair as wsp
@pytest.fixture(autouse=True)
def _alarm_cleanup():
'''
Ensure no leftover SIGALRM survives a test failure
or unexpected return.
'''
yield
signal.alarm(0)
def test_wakeup_socketpair_drain_eof_patch_works():
'''
Without the patch, `WakeupSocketpair.drain()` on a
socketpair whose write-end has been closed spins
forever. With the patch applied, it returns
cleanly within milliseconds.
Wall-clock cap: 2s. If the patch regresses, SIGALRM
fires and the test hard-fails with a clear signal
instead of hanging CI indefinitely.
'''
if not wsp.is_needed():
pytest.skip(
'upstream trio shipped the fix — '
'patch no longer needed for trio '
'(see `is_needed()` for version gate)'
)
# Apply the patch.
applied: bool = wsp.apply()
# First call MUST return True; idempotent guard
# prevents False on subsequent calls within the
# same process.
assert applied is True or applied is False # idempotent
# Cap wall-clock at 2s; SIGALRM raises in main
# thread which interrupts the C-level recv loop
# IF the patch regresses (since `signal.alarm`
# uses Python's signal-wakeup-fd which the patch
# itself relies on... but `repro()` runs OUTSIDE
# a trio.run, so it's plain stdlib semantics here
# — alarm WILL fire during `recv` syscall).
signal.alarm(2)
wsp.repro()
signal.alarm(0)
def test_apply_all_idempotent():
'''
Calling `apply_all()` twice should not double-
apply: second call's dict has all-False values
(every patch reports "already applied").
'''
first: dict[str, bool] = patches.apply_all()
second: dict[str, bool] = patches.apply_all()
# Second call: every patch reports skipped.
assert all(v is False for v in second.values()), (
f'apply_all() not idempotent: {second}'
)
# First call: at least one patch was applied
# (or all are no-ops because `is_needed()` is
# False everywhere — the all-fixed-upstream future
# state which is also valid).
assert isinstance(first, dict)
for name, applied in first.items():
assert isinstance(applied, bool), (
f'patch {name!r} returned non-bool: {applied!r}'
)