Add `tractor.trionics.patches` subpkg + first fix
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.
New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.
Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.
First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
& `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
instance whose write-end is closed, and `drain()` then **spins forever
in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.
Standalone repro:
from trio._core._wakeup_socketpair import WakeupSocketpair
ws = WakeupSocketpair()
ws.write_sock.close()
ws.drain() # spins forever
Patch is one-line — break the drain loop on b'' EOF.
Manifested as two distinct test failures:
- `tests/test_multi_program.py::test_register_duplicate_name` hung at
100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
both threads parked in `epoll_wait`, no TCP connect-back to parent
ever happened.
Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).
Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.
Conc-anal write-ups, including strace + py-spy evidence:
- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.
TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
body's evidence section.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0ef549fadb6b95d717457301f3470305dee1f01a)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
2026-06-10 00:23:26 +00:00
|
|
|
'''
|
|
|
|
|
Regression tests for `tractor.trionics.patches` —
|
|
|
|
|
defensive monkey-patches on upstream `trio` bugs.
|
|
|
|
|
|
|
|
|
|
Each test asserts:
|
|
|
|
|
|
|
|
|
|
1. The bug exists (or is gone — skip cleanly if
|
|
|
|
|
upstream shipped the fix and our `is_needed()` now
|
|
|
|
|
returns `False`).
|
|
|
|
|
2. Our patch fixes it (post-`apply()` the `repro()`
|
|
|
|
|
returns cleanly within a tight wall-clock cap).
|
|
|
|
|
|
|
|
|
|
Wall-clock caps are critical here — the bugs we patch
|
|
|
|
|
are tight-loops or deadlocks, so a regression would
|
|
|
|
|
HANG the test runner unless we hard-cap each
|
|
|
|
|
`repro()` call.
|
|
|
|
|
|
|
|
|
|
'''
|
|
|
|
|
import signal
|
|
|
|
|
|
|
|
|
|
import pytest
|
|
|
|
|
|
|
|
|
|
from tractor.trionics import patches
|
|
|
|
|
from tractor.trionics.patches import _wakeup_socketpair as wsp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
|
|
|
def _alarm_cleanup():
|
|
|
|
|
'''
|
|
|
|
|
Ensure no leftover SIGALRM survives a test failure
|
|
|
|
|
or unexpected return.
|
|
|
|
|
|
|
|
|
|
'''
|
|
|
|
|
yield
|
|
|
|
|
signal.alarm(0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def test_wakeup_socketpair_drain_eof_patch_works():
|
|
|
|
|
'''
|
|
|
|
|
Without the patch, `WakeupSocketpair.drain()` on a
|
|
|
|
|
socketpair whose write-end has been closed spins
|
|
|
|
|
forever. With the patch applied, it returns
|
|
|
|
|
cleanly within milliseconds.
|
|
|
|
|
|
|
|
|
|
Wall-clock cap: 2s. If the patch regresses, SIGALRM
|
|
|
|
|
fires and the test hard-fails with a clear signal
|
|
|
|
|
instead of hanging CI indefinitely.
|
|
|
|
|
|
|
|
|
|
'''
|
|
|
|
|
if not wsp.is_needed():
|
|
|
|
|
pytest.skip(
|
|
|
|
|
'upstream trio shipped the fix — '
|
|
|
|
|
'patch no longer needed for trio '
|
|
|
|
|
'(see `is_needed()` for version gate)'
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Apply the patch.
|
|
|
|
|
applied: bool = wsp.apply()
|
|
|
|
|
# First call MUST return True; idempotent guard
|
|
|
|
|
# prevents False on subsequent calls within the
|
|
|
|
|
# same process.
|
2026-06-18 17:37:55 +00:00
|
|
|
assert isinstance(applied, bool) # idempotent (order-dependent value)
|
Add `tractor.trionics.patches` subpkg + first fix
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.
New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.
Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.
First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
& `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
instance whose write-end is closed, and `drain()` then **spins forever
in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.
Standalone repro:
from trio._core._wakeup_socketpair import WakeupSocketpair
ws = WakeupSocketpair()
ws.write_sock.close()
ws.drain() # spins forever
Patch is one-line — break the drain loop on b'' EOF.
Manifested as two distinct test failures:
- `tests/test_multi_program.py::test_register_duplicate_name` hung at
100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
both threads parked in `epoll_wait`, no TCP connect-back to parent
ever happened.
Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).
Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.
Conc-anal write-ups, including strace + py-spy evidence:
- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.
TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
body's evidence section.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0ef549fadb6b95d717457301f3470305dee1f01a)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
2026-06-10 00:23:26 +00:00
|
|
|
|
|
|
|
|
# Cap wall-clock at 2s; SIGALRM raises in main
|
|
|
|
|
# thread which interrupts the C-level recv loop
|
|
|
|
|
# IF the patch regresses (since `signal.alarm`
|
|
|
|
|
# uses Python's signal-wakeup-fd which the patch
|
|
|
|
|
# itself relies on... but `repro()` runs OUTSIDE
|
|
|
|
|
# a trio.run, so it's plain stdlib semantics here
|
|
|
|
|
# — alarm WILL fire during `recv` syscall).
|
|
|
|
|
signal.alarm(2)
|
|
|
|
|
wsp.repro()
|
|
|
|
|
signal.alarm(0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def test_apply_all_idempotent():
|
|
|
|
|
'''
|
|
|
|
|
Calling `apply_all()` twice should not double-
|
|
|
|
|
apply: second call's dict has all-False values
|
|
|
|
|
(every patch reports "already applied").
|
|
|
|
|
|
|
|
|
|
'''
|
|
|
|
|
first: dict[str, bool] = patches.apply_all()
|
|
|
|
|
second: dict[str, bool] = patches.apply_all()
|
|
|
|
|
|
|
|
|
|
# Second call: every patch reports skipped.
|
|
|
|
|
assert all(v is False for v in second.values()), (
|
|
|
|
|
f'apply_all() not idempotent: {second}'
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# First call: at least one patch was applied
|
|
|
|
|
# (or all are no-ops because `is_needed()` is
|
|
|
|
|
# False everywhere — the all-fixed-upstream future
|
|
|
|
|
# state which is also valid).
|
|
|
|
|
assert isinstance(first, dict)
|
|
|
|
|
for name, applied in first.items():
|
|
|
|
|
assert isinstance(applied, bool), (
|
|
|
|
|
f'patch {name!r} returned non-bool: {applied!r}'
|
|
|
|
|
)
|