Commit Graph

1 Commits (38ffb875bdcb8ecc10149143349a3fdf5b9a7f9e)

Author SHA1 Message Date
Gud Boi 0ef549fadb Add `tractor.trionics.patches` subpkg + first fix
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.

New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.

Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.

First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
  ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
  & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
  instance whose write-end is closed, and `drain()` then **spins forever
  in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.

Standalone repro:

    from trio._core._wakeup_socketpair import WakeupSocketpair
    ws = WakeupSocketpair()
    ws.write_sock.close()
    ws.drain()  # spins forever

Patch is one-line — break the drain loop on b'' EOF.

Manifested as two distinct test failures:

- `tests/test_multi_program.py::test_register_duplicate_name` hung at
  100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
  deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
  both threads parked in `epoll_wait`, no TCP connect-back to parent
  ever happened.

Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).

Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.

Conc-anal write-ups, including strace + py-spy evidence:

- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`

Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.

TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
      body's evidence section.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-04 12:18:03 -04:00