Commit Graph

2672 Commits (255c9c3a7c4cb3fefda677bf90b8527a981d9524)

Author SHA1 Message Date
Gud Boi 57dae0e4a6 Split forkserver backend into variant 1/2 mods
The `subint_forkserver` name was always aspirational —
today's impl forks from a regular main-interp worker
thread and the child runs trio on its own main interp;
NO subinterp anywhere in parent or child. Splitting the
backend into two clearly-named variants drops the lie:

- **variant 1** — `main_thread_forkserver` (the working
  impl). New `SpawnMethodKey` literal + `_methods`
  dispatch entry + `_runtime.Actor._from_parent()`
  match-arm. The spawn-coro `subint_forkserver_proc`
  moves to `_main_thread_forkserver` and is renamed
  `main_thread_forkserver_proc()`.

- **variant 2** — `subint_forkserver` (future, reserved).
  Module shrinks to a placeholder describing the
  variant-2 design (subint-isolated child runtime, gated
  on jcrist/msgspec#1026 + PEP 684). Today the legacy
  `'subint_forkserver'` key aliases to
  `main_thread_forkserver_proc` so existing
  `--spawn-backend=subint_forkserver` invocations keep
  working; flipped to a `NotImplementedError` stub in a
  follow-up.

Deats,
- `Actor._from_parent()` spawn-method gate now accepts
  both `'main_thread_forkserver'` and
  `'subint_forkserver'` (both go through the
  IPC-`SpawnSpec` path).
- the variant-1 spawn-coro stamps its own `SpawnSpec` /
  log lines with `spawn_method='main_thread_forkserver'`
  so subactor renders reflect the actual mechanism.
- docstring reorg: trio×fork hazard breakdown, POSIX
  fork-survival semantics, in-process-vs-stdlib
  forkserver design notes, and the TODO/cleanup section
  all move from `_subint_forkserver` to
  `_main_thread_forkserver` (lives with the working
  code). `_subint_forkserver` keeps a tight forward-
  looking doc that motivates the reserved key.
- `run_subint_in_worker_thread()` stays in
  `_subint_forkserver` as the companion primitive — it's
  the subint counterpart to `fork_from_worker_thread()`
  and will plug into the future variant-2 spawn-coro.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 19:28:11 -04:00
Gud Boi 99dade0fb3 Extract fork primitives into `_main_thread_forkserver`
Move the truly-generic main-interp-worker-thread fork primitives
(`fork_from_worker_thread`, `_close_inherited_fds`, `_ForkedProc`,
`wait_child`, `_format_child_exit`) out of `_subint_forkserver.py` into
a sibling `_main_thread_forkserver.py` module so the primitive layer is
honestly named — none of these helpers touch a subint, they just fork
from a main-interp worker thread.

`_subint_forkserver.py` keeps its public surface intact via re-export so
any existing `from tractor.spawn._subint_forkserver import ...` callsite
still resolves.

Net: zero behavior change, preps the way for the upcoming spawn-method
key split where `main_thread_forkserver` ships as the working backend
and `subint_forkserver` becomes reserved for the future
subint-isolated-child variant (gated on jcrist/msgspec#1026).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 19:04:26 -04:00
Gud Boi 4b5176e2c3 Doc future-subint payoffs for `_subint_forkserver`
Adds a "Future arch — what subints would buy us" section to
the module docstring, complementing the prior commit's
current-state rationale. Code is unchanged.

Frames the `subint` prefix as family-naming today (no actual
subinterp is created yet), then lays out the three concrete
wins that land once jcrist/msgspec#1026 unblocks PEP 684
isolated-mode subints:

- Cheaper forks — moving the parent's `trio.run()` into a
  subint shrinks the main-interp COW image the child inherits.
  The main interp becomes the literal forkserver: an
  intentionally-empty execution ctx whose only job is to call
  `os.fork()` cleanly.

- True parallelism — per-interp GIL means the forkserver
  thread on main and the trio thread on subint actually run in
  parallel. Spawn latency stops stalling the trio loop.

- Multi-actor-per-process — the architectural payoff. With
  per-interp-GIL subints, one process can host main + N
  subint-resident actor `trio.run()`s, and `os.fork()` reverts
  to the last-resort spawn (only when OS-level isolation is
  actually needed). Joins the story with the in-thread
  `_subint.py` backend: `subint` → in-process spawn,
  `subint_forkserver` → cross-process when a real OS boundary
  is required.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 18:20:10 -04:00
Gud Boi 3ab99d557a Doc `_subint_forkserver` design + fork semantics
Major expansion of the module docstring. Code is
unchanged; this lands the architectural reasoning that
was previously implicit, plus the POSIX/trio fork
mechanics the design relies on.

New sections:
- "Design rationale" — answers two implicit questions:
  (1) why a forkserver pattern at all (vs. forking
  directly from a trio task), (2) why in-process (vs.
  stdlib `mp.forkserver`'s sidecar process). Documents
  the three costs the in-process design avoids
  (sidecar lifecycle, per-spawn IPC, cold-start child)
  and the tradeoffs we accept in exchange (3.14-only,
  heavier than `to_thread.run_sync`).
- "Implementation status" — clarifies what's actually
  landed today vs. the envisioned arch: parent's
  `trio.run()` still lives on main interp (subint-
  hosted root gated on jcrist/msgspec#1026). Names
  why the "subint" prefix is correct anyway — same PR
  series as `_subint.py` / `_subint_fork.py`.
- "What survives the fork? — POSIX semantics" — POSIX
  preserves only the calling thread, so the
  `trio.run()` thread is gone in the child. Includes
  a small parent/child thread-survival table and
  covers the four artifact classes that DO cross the
  fork boundary (inherited fds, COW memory, Python
  thread state, user-level locks) and how each is
  handled.
- "FYI: how this dodges the `trio.run()` × `fork()`
  hazards" — itemizes each class of trio process-
  global state (wakeup-fd, `epoll`/`kqueue`,
  threadpool, cancel scopes / nurseries, `atexit`,
  foreign-language I/O) and explains how the
  forkserver-thread design avoids each.

Also,
- bump the gated msgspec issue link from
  `jcrist/msgspec#563` to `jcrist/msgspec#1026` (the
  PEP 684 isolated-mode tracker).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 18:16:50 -04:00
Gud Boi 54561959e6 Log subint bootstrap excs + cancel-leak state
Two diagnostic gaps in `tractor.spawn._subint.subint_proc()` that hid
otherwise-silent failures, plus tracking-issue links on the two open
`subint_forkserver` follow-ups.

Deats,
- bootstrap-exc visibility: wrap the call to
  `_interpreters.exec(interp_id, bootstrap)` with
  `try/except BaseException` + `log.exception(...)`.
  * Without it, an `ImportError` / `SyntaxError` raised inside the
    dedicated driver thread goes only to Python's default thread
    excepthook — invisible to the parent, which then waits forever on
    `subint_exited.wait()`.
  * `?TODO` notes `anyio`'s `to_interpreter._interp_call` +
    `(retval, is_exception)` pattern as the next step for re-raising;
    skipped now bc it must coordinate with the `trio.Cancelled` paths
    around the existing `.wait()` calls.

- cancel-leak disambiguation: when the driver thread doesn't exit within
  `_HARD_KILL_TIMEOUT`, also log `_interpreters.is_running(interp_id)`
  as `subint_still_running=...` so the operator can tell "thread leaked,
  subint already done" apart from "thread alive bc subint is wedged".
  * pattern borrowed from `trio-parallel`'s `_sint.SintWorker.is_alive()`.

- `?TODO` near the `bootstrap` literal: future switch to
  `_interpreters.set___main___attrs()` — same API `anyio`
  uses in `to_interpreter._Worker.call()` — for passing
  non-`repr()`-roundtrippable values (`SpawnSpec` struct, callables,
  etc).
  * add cross-refs tracking issue `#379`.

Also,
- `Tracked at: [#449]` link on
  `subint_forkserver_test_cancellation_leak_issue.md`.
- `Tracked at: [#450]` link on
  `subint_forkserver_thread_constraints_on_pep684_issue.md`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 15:57:55 -04:00
Gud Boi 66f1941f46 Wire `reg_addr` into `test_context_stream_semantics`
Same wire-up pattern as the prior `test_dynamic_pub_sub`
commit: each test that already pulled in `debug_mode`
now also pulls in `reg_addr` and passes
`registry_addrs=[reg_addr]` into `tractor.open_nursery()`,
so the suite's standard registry-addr conventions apply.

Tests touched:
- `test_started_misuse`
- `test_simple_context`
- `test_parent_cancels`
- `test_one_end_stream_not_opened`
- `test_maybe_allow_overruns_stream`
- `test_ctx_with_self_actor`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 13:52:28 -04:00
Gud Boi 9b05f659b3 Wire `test_dynamic_pub_sub` to standard fixtures
Pull in the `reg_addr`, `debug_mode`, and `test_log`
fixtures so this test follows the same conventions as
the rest of the suite:

- pass `registry_addrs=[reg_addr]` + `debug_mode` into
  `tractor.open_nursery()` (so `--tpdb` etc work).
- after the `pytest.raises` block, add `assert err` +
  `test_log.exception('Timed out AS EXPECTED')` so the
  expected timeout is logged explicitly instead of
  swallowed.

Also,
- drop whitespace-only blank lines around the
  `subs` param of `consumer()` and `ctx` param of
  `one_task_streams_and_one_handles_reqresp()`.
- promote `test_sigint_both_stream_types`'s one-line
  docstring to multi-line form.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 12:59:00 -04:00
Gud Boi 65fcfbf224 Bump `test_stale_entry_is_deleted`'s timeout to 30
Seems that when run in-suite it delays more then the so-measured "happy
path" timing; better to have no suite-global interruption then asserting
a fast single test's run.
2026-04-27 11:46:45 -04:00
Gud Boi 4f12d69b41 Add `--shm` orphan sweep to `tractor-reap`
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.

Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
  + `reap_shm()` helpers. Match criteria: regular file
  under `/dev/shm`, owned by current uid, AND no live
  proc has it open (mmap'd or fd-held). In-use
  enumeration via `psutil.Process.memory_maps()` +
  `.open_files()` — xplatform, kernel-canonical (same
  answer `lsof` would give), no reliance on
  tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
  `NotImplementedError` outside Linux/FreeBSD bc macOS
  POSIX shm has no fs-visible path (`shm_open` only)
  and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
  process reap) and `--shm-only` (skip process phase)
  flags. `-n` dry-runs both phases. Exit code is `1`
  if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
  the `testing` dep group; lazy-imported in `_reap.py`
  so the process-reap path stays import-clean without
  it.

Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
  (new section 10c) — covers match criteria + the
  preservation guarantee for unrelated apps.
- flip mitigation status in
  `subint_forkserver_mp_shared_memory_issue.md` from
  "could extend `tractor-reap`" to "implemented", with
  a note that callers should still UUID-pin shm keys to
  avoid cross-session collisions.

Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 11:35:33 -04:00
Gud Boi aa3e230926 Fix `SharedMemory` under `subint_forkserver`
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.

Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
  `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
  now run unconditionally:
  * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
    `ManTracker` subclass (empty `register` / `unregister`
    / `ensure_running`).
  * return `partial(SharedMemory, track=False)` for the per-allocation
    opt-out.
  * belt + suspenders: even if something dodges the wrapper, the
    singleton can't talk to the inherited (broken) parent fd.

- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
  skip of the unlink-callback; install a `try_unlink()` wrapper that
  swallows `FileNotFoundError` (sibling-already-cleaned race in
  shared-key setups). Without `mp.resource_tracker` doing it for us, we
  own the unlink — `actor. lifetime_stack` is the right place since
  tractor already controls actor lifecycle.

- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
  module-level skip- list (tests pass now). Inline comment cross-refs
  the two `_mp_bs` / `_shm` workarounds.

- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
  rewrite — flips status from "open / unresolvable in tractor" to
  "resolved, kept as decision record". Adds Resolution section, "Why
  this is the right call" rationale (mp tracker is widely criticized;
  tractor already owns lifecycle), trade-offs (crash-leaked segments,
  lost mp leak warning), verification (7 passed under both
  `subint_forkserver` and `trio` backends), and upstream issue links

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 10:51:28 -04:00
Gud Boi c99d475d03 Document `SharedMemory` × `subint_forkserver` incompat
New `ai/conc-anal/` doc: `mp.SharedMemory` is
fork-without-exec unsafe — child inherits parent's
`resource_tracker` fd → EBADF on first shm op;
leaked `/shm_list` cascades `FileExistsError`
across parametrize variants. Canonical CPython
issue class, NOT a tractor bug. Includes two
longer-term mitigation paths (reset inherited
tracker fd vs migrate off `mp.shared_memory`).

Also, update `tests/test_shm.py`:
- comment out `subint_forkserver` from skip list
- rewrite reason with precise failure-mode
  descriptions + link to the analysis doc

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-26 20:13:24 -04:00
Gud Boi 6d76b60404 Add `tractor-reap` CLI + document auto-reap
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:

- orphan-mode (default): finds PPid==1 procs
  with cwd matching repo root + `python` in
  cmdline.
- descendant-mode (`--parent <pid>`): scoped
  sweep under a still-live supervisor.

SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).

Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-26 18:04:40 -04:00
Gud Boi eae478f3d5 Add `_testing._reap` + auto-reap fixture
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.

Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
  `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
  - `find_descendants(parent_pid)` for the in-session case
    (PPid-direct-match while pytest is still alive).
  - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
    reparented to init + `cwd` filter to repo root + `python` cmdline
    filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
  all, poll up to `grace` for exit, SIGKILL any survivors. Returns
  `(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
  `tractor/_testing/pytest.py` — after `yield`, runs
  `find_descendants(os.getpid())` + `reap(...)` so each pytest session
  leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
  commit) for the pytest-died-mid-session case where the in-session
  fixture didn't get to run.

Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
  module-top in `pytest.py` (was inline-imported inside
  `pytest_generate_tests`), and reuse it in
  `pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
  mark arg is a valid spawn-method literal — catches typos at collection
  time.
- inline `# ?TODO` flags running these through the `try_set_backend`
  checker for stronger validation.

Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-25 00:05:58 -04:00
Gud Boi 44bdb1697c Tighten orphan-SIGINT xfail to `strict=True`
Re-classify `test_orphaned_subactor_sigint_cleanup_DRAFT` from
flakey-env-sensitive (`strict=False` w/ "passes in isolation, flakey in
full suite") to a hard known-gap (`strict=True`) with the orphan-SIGINT
hang as the documented cause. The previous framing ("env pollution") let
the test silently pass when ordering happened to favor it; the new
framing forces an XPASS-as-FAIL the moment the underlying gap is
actually closed, so we can drop the mark intentionally instead of
accidentally.

Reason text + leading `# Known-gap test —` comment both point at
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
for the full diagnosis.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-24 22:48:35 -04:00
Gud Boi 2ca0f41e61 Skip `test_loglevel_propagated_to_subactor` on subint forkserver too 2026-04-24 21:47:46 -04:00
Gud Boi b350aa09ee Wire `reg_addr` through infected-asyncio tests
Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.

Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
  calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
  - add `reg_addr`, `debug_mode`, `start_method`
    fixture params
  - `delay` now reads the `debug_mode` param directly
    instead of calling `tractor.debug_mode()` (fires
    slightly earlier in the test lifecycle)
  - sanity assert `if debug_mode: assert
    tractor.debug_mode()` after nursery open
  - new print showing SIGINT target
    (`send_sigint_to` + resolved pid)
  - catch `trio.TooSlowError` around
    `ctx.wait_for_result()` and conditionally
    `pytest.xfail` when `send_sigint_to == 'child'
    and start_method == 'subint_forkserver'` — the
    known orphan-SIGINT limitation tracked in
    `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-24 20:26:25 -04:00
Gud Boi d6e70e9de4 Import-or-skip `.devx.` tests requiring `greenback`
Which is for sure true on py3.14+ rn since `greenlet` didn't want to
build for us (yet).
2026-04-24 17:39:13 -04:00
Gud Boi 4c133ab541 Default `pytest` to use `--capture=sys`
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).

Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
  `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
  capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
  `sys.stderr`).
- KEPT: `pytest -s` debugging behavior.

This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.

Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
  rationale comment cross-ref'ing the post-mortem doc

- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
  from `test_nested_ multierrors` — no longer needed.
  * file-level `pytestmark` covers any residual.

- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
  mark loosened from `strict=True` to `strict=False` + reason rewritten.
  * it passes in isolation but is session-env-pollution sensitive
    (leftover subactor PIDs competing for ports / inheriting harness
    FDs).
  * tolerate both outcomes until suite isolation improves.

- `test_shm`: extend the existing
  `skipon_spawn_backend('subint', ...)` to also skip
  `'subint_forkserver'`.
  * Different root cause from the cancel-cascade class:
    `multiprocessing.SharedMemory`'s `resource_tracker` + internals
    assume fresh- process state, don't survive fork-without-exec cleanly

- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
  (unrelated to forkserver; just a flaky-under-load bump).

- `tractor.spawn._subint_forkserver`: inline comment-only future-work
  marker right before `_actor_child_main()` describing the planned
  conditional stdout/stderr-to-`/dev/null` redirect for cases where
  `--capture=sys` isn't enough (no code change — the redirect logic
  itself is deferred).

EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-24 14:17:23 -04:00
Gud Boi 4106ba73ea Codify capture-pipe hang lesson in skills
Encode the hard-won lesson from the forkserver
cancel-cascade investigation into two skill docs
so future sessions grep-find it before spelunking
into trio internals.

Deats,
- `.claude/skills/conc-anal/SKILL.md`:
  - new "Unbounded waits in cleanup paths"
    section — rule: bound every `await X.wait()`
    in cleanup paths with `trio.move_on_after()`
    unless the setter is unconditionally
    reachable. Recent example:
    `ipc_server.wait_for_no_more_peers()` in
    `async_main`'s finally (was unbounded,
    deadlocked when any peer handler stuck)
  - new "The capture-pipe-fill hang pattern"
    section — mechanism, grep-pointers to the
    existing `conftest.py` guards (`tests/conftest
    .py:258`, `:316`), cross-ref to the full
    post-mortem doc, and the grep-note: "if a
    multi-subproc tractor test hangs, `pytest -s`
    first, conc-anal second"
- `.claude/skills/run-tests/SKILL.md`: new
  "Section 9: The pytest-capture hang pattern
  (CHECK THIS FIRST)" with symptom / cause /
  pre-existing guards to grep / three-step debug
  recipe (try `-s`, lower loglevel, redirect
  stdout/stderr) / signature of this bug vs. a
  real code hang / historical reference

Cost several investigation sessions before the
capture-pipe issue surfaced — it was masked by
deeper cascade deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume. Lesson: **grep this
pattern first when any multi-subproc tractor test
hangs under default pytest but passes with `-s`.**

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 23:22:40 -04:00
Gud Boi eceed29d4a Pin forkserver hang to pytest `--capture=fd`
Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.

Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.

Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.

Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.

Deats,
- `subint_forkserver_test_cancellation_leak_issue
  .md`: new section "Update — VERY late: pytest
  capture pipe IS the final gate" w/ DIAG timeline
  showing `trio.run` fully returns, diagnosis of
  pipe-fill mechanism, retrospective on the
  earlier wrong ruling-out, and fix direction
  (redirect subactor stdout/stderr to `/dev/null`
  in fork-child prelude, conditional on
  pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
  rewritten to describe the capture-pipe gate
  specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
  orphan-SIGINT test regresses back to xfail.
  Previously passed after the FD-hygiene fix,
  but the new `wait_for_no_more_peers(
  move_on_after=3.0)` bound in `async_main`'s
  teardown added up to 3s latency, pushing
  orphan-subactor exit past the test's 10s poll
  window. Real fix: faster orphan-side teardown
  OR extend poll window to 15s

No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 23:18:14 -04:00
Gud Boi e312a68d8a Bound peer-clear wait in `async_main` finally
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 22:34:49 -04:00
Gud Boi 4d0555435b Narrow forkserver hang to `async_main` outer tn
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.

Fresh-run results,
- **9 processes complete the full flow**:
  `trio.run RETURNED NORMALLY` → `child_target
  RETURNED rc=0` → `os._exit(0)`. These are tree
  LEAVES (errorers) plus their direct parents
  (depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
  main)`**: hit "about to trio.run" but never
  see "trio.run RETURNED NORMALLY". These are
  root + top-level spawners + one intermediate

The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:

    trio.run never returns
      → _trio_main finally never runs
        → _worker never reaches os._exit(rc)
          → process never dies
            → parent's _ForkedProc.wait() blocks
              → parent's nursery hangs
                → parent's async_main hangs
                  → (recurse up)

The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
   in `root_tn` — but we cancel it via
   `_parent_chan_cs.cancel()` in `Actor.cancel()`,
   which only runs during
   `open_root_actor.__aexit__`, which itself runs
   only after `async_main`'s outer unwind — which
   doesn't happen. So the shield isn't broken in
   this path.
2. `actor_nursery._join_procs.wait()` or similar
   inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
   exit — but pidfd_open watch didn't fire (race
   between `pidfd_open` and the child exiting?).

Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.

Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 21:36:19 -04:00
Gud Boi ab86f7613d Refine `subint_forkserver` cancel-cascade diag
Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.

Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
  retested explicitly — `test_nested_multierrors`
  hangs identically with and without `-s`. The
  earlier observation was likely a competing
  pytest process I had running in another session
  holding registry state
- **stuck peer-chan recv that cancel can't
  break**: pivot from the prior pass. With
  `handle_stream_from_peer` instrumented at ENTER
  / `except trio.Cancelled:` / finally: 40
  ENTERs, ZERO `trio.Cancelled` hits. Cancel never
  reaches those tasks at all — the recvs are
  fine, nothing is telling them to stop

Actual deadlock shape: multi-level mutual wait.

    root              blocks on spawner.wait()
      spawner         blocks on grandchild.wait()
        grandchild    blocks on errorer.wait()
          errorer     Actor.cancel() ran, but proc
                      never exits

`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.

Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
   shielded loop's recv is stuck in a way cancel
   still can't break
2. `async_main`'s post-cancel unwind has other
   tasks in `root_tn` awaiting something that
   never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
   `_child_target` never returns

Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
   confirm whether `child_target()` returns /
   `os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
   which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
   equivalent level to spot divergence

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 21:23:11 -04:00
Gud Boi 458a35cf09 Surface silent failures in `_subint_forkserver`
Three places that previously swallowed exceptions silently now log via
`log.exception()` so they surface in the runtime log when something
weird happens — easier to track down sneaky failures in the
fork-from-worker-thread / subint-bootstrap primitives.

Deats,
- `_close_inherited_fds()`: post-fork child's per-fd `os.close()`
  swallow now logs the fd that failed to close. The comment notes the
  expected failure modes (already-closed-via-listdir-race,
  otherwise-unclosable) — both still fine to ignore semantically, but
  worth flagging in the log.
- `fork_from_worker_thread()` parent-side timeout branch: the
  `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close
  failure separately before raising the `worker thread didn't return`
  RuntimeError.
- `run_subint_in_worker_thread._drive()`: when
  `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`,
  log the full call signature (interp_id + bootstrap) along with the
  captured exception, before stashing into `err` for the outer caller.

Behavior unchanged — only adds observability.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 7cd47ef7fb Doc ruled-out fix + capture-pipe aside
Two new sections in
`subint_forkserver_test_cancellation_leak_issue.md`
documenting continued investigation of the
`test_nested_multierrors[subint_forkserver]` peer-
channel-loop hang:

1. **"Attempted fix (DID NOT work) — hypothesis
   (3)"**: tried sync-closing peer channels' raw
   socket fds from `_serve_ipc_eps`'s finally block
   (iterate `server._peers`, `_chan._transport.
   stream.socket.close()`). Theory was that sync
   close would propagate as `EBADF` /
   `ClosedResourceError` into the stuck
   `recv_some()` and unblock it. Result: identical
   hang. Either trio holds an internal fd
   reference that survives external close, or the
   stuck recv isn't even the root blocker. Either
   way: ruled out, experiment reverted, skip-mark
   restored.
2. **"Aside: `-s` flag changes behavior for peer-
   intensive tests"**: noticed
   `test_context_stream_semantics.py` under
   `subint_forkserver` hangs with default
   `--capture=fd` but passes with `-s`
   (`--capture=no`). Working hypothesis: subactors
   inherit pytest's capture pipe (fds 1,2 — which
   `_close_inherited_fds` deliberately preserves);
   verbose subactor logging fills the buffer,
   writes block, deadlock. Fix direction (if
   confirmed): redirect subactor stdout/stderr to
   `/dev/null` or a file in `_actor_child_main`.
   Not a blocker on the main investigation;
   deserves its own mini-tracker.

Both sections are diagnosis-only — no code changes
in this commit.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 76d12060aa Claude-perms: ensure /commit-msg files can be written! 2026-04-23 18:48:34 -04:00
Gud Boi 506617c695 Skip-mark + narrow `subint_forkserver` cancel hang
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:

1. Skip-mark the test via
   `@pytest.mark.skipon_spawn_backend('subint_forkserver',
   reason=...)` so it stops blocking the test
   matrix while the remaining bug is being chased.
   The reason string cross-refs the conc-anal doc
   for full context.

2. Update the conc-anal doc
   (`subint_forkserver_test_cancellation_leak_issue.md`) with the
   empirical state after the three nested- cancel fix commits
   (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
   shield break) landed, narrowing the remaining hang from "everything
   broken" to "peer-channel loops don't exit on `service_tn` cancel".

Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
  `_parent_chan_cs.cancel()` wiring from `57935804`
  works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
  channel handlers in `handle_stream_from_peer`
  (inbound connections handled by
  `stream_handler_tn`, which IS `service_tn` in the
  current config).
- after `_parent_chan_cs.cancel()` fires, NEW
  shielded loops appear on the session reg_addr
  port — probably discovery-layer reconnection;
  doesn't block teardown but indicates the cascade
  has more moving parts than expected.

The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 8ac3dfeb85 Break parent-chan shield during teardown
Completes the nested-cancel deadlock fix started in
0cd0b633 (fork-child FD scrub) and fe540d02 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi c20b05e181 Use `pidfd` for cancellable `_ForkedProc.wait`
Two coordinated improvements to the `subint_forkserver` backend:

1. Replace `trio.to_thread.run_sync(os.waitpid, ...,
   abandon_on_cancel=False)` in `_ForkedProc.wait()`
   with `trio.lowlevel.wait_readable(pidfd)`. The
   prior version blocked a trio cache thread on a
   sync syscall — outer cancel scopes couldn't
   unwedge it when something downstream got stuck.
   Same pattern `trio.Process.wait()` and
   `proc_waiter` (the mp backend) already use.

2. Drop the `@pytest.mark.xfail(strict=True)` from
   `test_orphaned_subactor_sigint_cleanup_DRAFT` —
   the test now PASSES after 0cd0b633 (fork-child
   FD scrub). Same root cause as the nested-cancel
   hang: inherited IPC/trio FDs were poisoning the
   child's event loop. Closing them lets SIGINT
   propagation work as designed.

Deats,
- `_ForkedProc.__init__` opens a pidfd via
  `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+)
- `wait()` parks on `trio.lowlevel.wait_readable()`,
  then non-blocking `waitpid(WNOHANG)` to collect
  the exit status (correct since the pidfd signal
  IS the child-exit notification)
- `ChildProcessError` swallow handles the rare race
  where someone else reaps first
- pidfd closed after `wait()` completes (one-shot
  semantics) + `__del__` belt-and-braces for
  unexpected-teardown paths
- test docstring's `@xfail` block replaced with a
  `# NOTE` comment explaining the historical
  context + cross-ref to the conc-anal doc; test
  remains in place as a regression guard

The two changes are interdependent — the
cancellable `wait()` matters for the same nested-
cancel scenarios the FD scrub fixes, since the
original deadlock had trio cache workers wedged in
`os.waitpid` swallowing the outer cancel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 9993db0193 Scrub inherited FDs in fork-child prelude
Implements fix-direction (1)/blunt-close-all-FDs from
b71705bd (`subint_forkserver` nested-cancel hang
diag), targeting the multi-level cancel-cascade
deadlock in
`test_nested_multierrors[subint_forkserver]`.

The diagnosis doc voted for surgical FD cleanup via
`actor.ipc_server` handle as the cleanest approach,
but going blunt is actually the right call: after
`os.fork()`, the child immediately enters
`_actor_child_main()` which opens its OWN IPC
sockets / wakeup-fd / epoll-fd / etc. — none of the
parent's FDs are needed. Closing everything except
stdio is safe AND defends against future
listener/IPC additions to the parent inheriting
silently into children.

Deats,
- new `_close_inherited_fds(keep={0,1,2}) -> int`
  helper. Linux fast-path enumerates `/proc/self/fd`;
  POSIX fallback uses `RLIMIT_NOFILE` range. Matches
  the stdlib `subprocess._posixsubprocess.close_fds`
  strategy. Returns close-count for sanity logging
- wire into `fork_from_worker_thread._worker()`'s
  post-fork child prelude — runs immediately after
  the pid-pipe `os.close(rfd/wfd)`, before the user
  `child_target` callable executes
- docstring cross-refs the diagnosis doc + spells
  out the FD-inheritance-cascade mechanism and why
  the close-all approach is safe for our spawn shape

Validation pending: re-run `test_nested_multierrors[subint_forkserver]`
to confirm the deadlock is gone.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 35da808905 Refine `subint_forkserver` nested-cancel hang diagnosis
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:

1. **5-zombie leak holding `:1616`** — turned out to
   be a self-inflicted cleanup bug: `pkill`-ing a bg
   pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
   the SC graceful cancel cascade entirely. Codified
   the real fix — SIGINT-first ladder w/ bounded
   wait before SIGKILL — in e5e2afb5 (`run-tests`
   SKILL) and
   `feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
   hangs indefinitely** — the actual backend bug,
   and it's a deadlock not a leak.

Deats,
- new diagnosis: all 5 procs are kernel-`S` in
  `do_epoll_wait`; pytest-main's trio-cache workers
  are in `os.waitpid` waiting for children that are
  themselves waiting on IPC that never arrives —
  graceful `Portal.cancel_actor` cascade never
  reaches its targets
- tree-structure evidence: asymmetric depth across
  two identical `run_in_actor` calls — child 1
  (3 threads) spawns both its grandchildren; child 2
  (1 thread) never completes its first nursery
  `run_in_actor`. Smells like a race on fork-
  inherited state landing differently per spawn
  ordering
- new hypothesis: `os.fork()` from a subactor
  inherits the ROOT parent's IPC listener FDs
  transitively. Grandchildren end up with three
  overlapping FD sets (own + direct-parent + root),
  so IPC routing becomes ambiguous. Predicts bug
  scales with fork depth — matches reality: single-
  level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
  reaches hard-kill path), `:1616` contention (fixed
  by `reg_addr` fixture wiring), GIL starvation
  (each subactor has its own OS process+GIL),
  child-side KBI absorption (`_trio_main` only
  catches KBI at `trio.run()` callsite, reached
  only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
  `closerange()`, (2) `FD_CLOEXEC` + audit,
  (3) targeted FD cleanup via `actor.ipc_server`
  handle, (4) `os.posix_spawn` w/ `file_actions`.
  Vote: (3) — surgical, doesn't break the "no exec"
  design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
  2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
  level-spawn tests under the backend via
  `@pytest.mark.skipon_spawn_backend(...)` until
  fix lands

Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 70d58c4bd2 Use SIGINT-first ladder in `run-tests` cleanup
The previous cleanup recipe went straight to
SIGTERM+SIGKILL, which hides bugs: tractor is
structured concurrent — `_trio_main` catches SIGINT
as an OS-cancel and cascades `Portal.cancel_actor`
over IPC to every descendant. So a graceful SIGINT
exercises the actual SC teardown path; if it hangs,
that's a real bug to file (the forkserver `:1616`
zombie was originally suspected to be one of these
but turned out to be a teardown gap in
`_ForkedProc.kill()` instead).

Deats,
- step 1: `pkill -INT` scoped to `$(pwd)/py*` — no
  sleep yet, just send the signal
- step 2: bounded wait loop (10 × 0.3s = ~3s) using
  `pgrep` to poll for exit. Loop breaks early on
  clean exit
- step 3: `pkill -9` only if graceful timed out, w/
  a logged escalation msg so it's obvious when SC
  teardown didn't complete
- step 4: same SIGINT-first ladder for the rare
  `:1616`-holding zombie that doesn't match the
  cmdline pattern (find PID via `ss -tlnp`, then
  `kill -INT NNNN; sleep 1; kill -9 NNNN`)
- steps 5-6: UDS-socket `rm -f` + re-verify
  unchanged

Goal: surface real teardown bugs through the test-
cleanup workflow instead of papering over them with
`-9`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 1af2121057 Wire `reg_addr` through leaky cancel tests
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi e3f4f5a387 Add `subint_forkserver` test-cancellation leak doc
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.

Deats,
- TL;DR + ruled-out checks confirming the procs are
  ours (not piker / other tractor-embedding apps) —
  `/proc/$pid/cmdline` + cwd both resolve to this
  repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
  (plain `os.kill(SIGKILL)` to the direct child),
  not tree-scoped — grandchildren spawned during a
  multi-level cancel test get reparented to init and
  inherit the registry listen socket
- proposed fix directions ranked: (1) put each
  forkserver-spawned subactor in its own process-
  group (`os.setpgrp()` in fork-child) + tree-kill
  via `os.killpg(pgid, SIGKILL)` on teardown,
  (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
  `/proc/<pid>/task/*/children` walk. Vote: (1) —
  POSIX-standard, aligns w/ `start_new_session=True`
  semantics in `subprocess.Popen` / trio's
  `open_process`
- inline reproducer + cleanup recipe scoped to
  `$(pwd)/py314/bin/python.*pytest.*spawn-backend=
  subint_forkserver` so cleanup doesn't false-flag
  unrelated tractor procs (consistent w/
  `run-tests` skill's zombie-check guidance)

Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi d093c31979 Add zombie-actor check to `run-tests` skill
Fork-based backends (esp. `subint_forkserver`) can
leak child actor processes on cancelled / SIGINT'd
test runs; the zombies keep the tractor default
registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`)
bound, so every subsequent session can't bind and
50+ unrelated tests fail with the same
`TooSlowError` / "address in use" signature. Document
the pre-flight + post-cancel check as a mandatory
step 4.

Deats,
- **primary signal**: `ss -tlnp | grep ':1616'` for a
  bound TCP registry listener — the authoritative
  check since :1616 is unique to our runtime
- `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.*
  _actor_child_main|subint-forkserv` for leftover
  actor/forkserver procs — scoped deliberately so we
  don't false-flag legit long-running tractor-
  embedding apps like `piker`
- `ls /tmp/registry@*.sock` for stale UDS sockets
- scoped cleanup recipe (SIGTERM + SIGKILL sweep
  using the same `$(pwd)/py*` pattern, UDS `rm -f`,
  re-verify) plus a fallback for when a zombie holds
  :1616 but doesn't match the pattern: `ss -tlnp` →
  kill by PID
- explicit false-positive warning calling out the
  `piker` case (`~/repos/piker/py*/bin/python3 -m
  tractor._child ...`) so a bare `pgrep` doesn't lead
  to nuking unrelated apps

Goal: short-circuit the "spelunking into test code"
rabbit-hole when the real cause is just a leaked PID
from a prior session, without collateral damage to
other tractor-embedding projects on the same box.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 1e357dcf08 Mv `test_subint_cancellation.py` to `tests/spawn/` subpkg
Also, some slight touchups in `.spawn._subint`.
2026-04-23 18:48:34 -04:00
Gud Boi e31eb8d7c9 Label forkserver child as `subint_forkserver`
Follow-up to 72d1b901 (was prev commit adding `debug_mode` for
`subint_forkserver`): that commit wired the runtime-side
`subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the
`subint_forkserver_proc` child-target was still passing
`spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines
would report the subactor as plain `'trio'` instead of the actual
parent-side spawn mechanism. Flip the label to `'subint_forkserver'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 8bcbe730bf Enable `debug_mode` for `subint_forkserver`
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.

Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
  listing the spawn backends whose subactor runtime is trio-native
  (`'trio'`, `'subint_forkserver'`). Both the enable-site
  (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
  key.
  off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
  reports the full compatible-set + the rejected method instead of the
  stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
  `'subint_forkserver'` to the existing `('trio', 'subint')` tuple
  — fork child-side runtime receives the same SpawnSpec IPC handshake as
  the others.
- `subint_forkserver_proc` child-target now passes
  `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
  `Actor.pformat()` / log lines reflect the actual parent-side spawn
  mechanism rather than masquerading as plain `trio`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 5e85f184e0 Drop unneeded f-str prefixes 2026-04-23 18:48:34 -04:00
Gud Boi f5f37b69e6 Shorten some timeouts in `subint_forkserver` suites 2026-04-23 18:48:34 -04:00
Gud Boi a72deef709 Refine `subint_forkserver` orphan-SIGINT diagnosis
Empirical follow-up to the xfail'd orphan-SIGINT test:
the hang is **not** "trio can't install a handler on a
non-main thread" (the original hypothesis from the
`child_sigint` scaffold commit). On py3.14:

- `threading.current_thread() is threading.main_thread()`
  IS True post-fork — CPython re-designates the
  fork-inheriting thread as "main" correctly
- trio's `KIManager` SIGINT handler IS installed in the
  subactor (`signal.getsignal(SIGINT)` confirms)
- the kernel DOES deliver SIGINT to the thread

But `faulthandler` dumps show the subactor wedged in
`trio/_core/_io_epoll.py::get_events` — trio's
wakeup-fd mechanism (which turns SIGINT into an epoll-wake)
isn't firing. So the `except KeyboardInterrupt` at
`tractor/spawn/_entry.py::_trio_main:164` — the runtime's
intentional "KBI-as-OS-cancel" path — never fires.

Deats,
- new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
  (+385 LOC): full writeup — TL;DR, symptom reproducer,
  the "intentional cancel path" the bug defeats,
  diagnostic evidence (`faulthandler` output +
  `getsignal` probe), ruled-out hypotheses
  (non-main-thread issue, wakeup-fd inheritance,
  KBI-as-trio-check-exception), and fix directions
- `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail
  `reason` + test docstring rewritten to match the
  refined understanding — old wording blamed the
  non-main-thread path, new wording points at the
  `epoll_wait` wedge + cross-refs the new conc-anal doc
- `_subint_forkserver` module docstring's
  `child_sigint='trio'` bullet updated: now notes trio's
  handler is already correctly installed, so the flag may
  end up a no-op / doc-only mode once the real root cause
  is fixed

Closing the gap aligns with existing design intent (make
the already-designed "KBI-as-OS-cancel" behavior actually
fire), not a new feature.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi dcd5c1ff40 Scaffold `child_sigint` modes for forkserver
Add configuration surface for future child-side SIGINT
plumbing in `subint_forkserver_proc` without wiring up the
actual trio-native SIGINT bridge — lifting one entry-guard
clause will flip the `'trio'` branch live once the
underlying fork-prelude plumbing is implemented.

Deats,
- new `ChildSigintMode = Literal['ipc', 'trio']` type +
  `_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default.
  Docstring block enumerates both:
  - `'ipc'` (default, currently the only implemented mode):
    no child-side SIGINT handler — `trio.run()` is on the
    fork-inherited non-main thread where
    `signal.set_wakeup_fd()` is main-thread-only, so
    cancellation flows exclusively via the parent's
    `Portal.cancel_actor()` IPC path. Known gap: orphan
    children don't respond to SIGINT
    (`test_orphaned_subactor_sigint_cleanup_DRAFT`)
  - `'trio'` (scaffolded only): manual SIGINT → trio-cancel
    bridge in the fork-child prelude so external Ctrl-C
    reaches stuck grandchildren even w/ a dead parent
- `subint_forkserver_proc` pulls `child_sigint` out of
  `proc_kwargs` (matches how `trio_proc` threads config to
  `open_process`, keeps `start_actor(proc_kwargs=...)` as
  the ergonomic entry point); validates membership + raises
  `NotImplementedError` for `'trio'` at the backend-entry
  guard
- `_child_target` grows a `match child_sigint:` arm that
  slots in the future `'trio'` impl without restructuring
  — today only the `'ipc'` case is reachable
- module docstring "Still-open work" list grows a bullet
  pointing at this config + the xfail'd orphan-SIGINT test

No behavioral change on the default path — `'ipc'` is the
existing flow. Scaffolding only.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 76605d5609 Add DRAFT `subint_forkserver` orphan-SIGINT test
Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT`
documents an empirical SIGINT-delivery gap in the
`subint_forkserver` backend: when the parent dies via
`SIGKILL` (no IPC `Portal.cancel_actor()` possible) and
`SIGINT` is sent to the orphan child, the child DOES NOT
unwind — CPython's default `KeyboardInterrupt` is delivered
to `threading.main_thread()`, whose tstate is dead in the
post-fork child bc fork inherited the worker thread, not
main. Trio running on the fork-inherited worker thread
therefore never observes the signal. Marked
`xfail(strict=True)` so the mark flips to XPASS→fail once
the backend grows explicit SIGINT plumbing.

Deats,
- harness runs the failure-mode sequence out-of-process:
  1. harness subprocess runs a fresh Python script
     that calls `try_set_start_method('subint_forkserver')`
     then opens a root actor + one `sleep_forever` subactor
  2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers
     off harness `stdout` to confirm IPC handshake
     completed
  3. `SIGKILL` the parent, `proc.wait()` to reap the
     zombie (otherwise `os.kill(pid, 0)` keeps reporting
     it alive)
  4. assert the child survived the parent-reap (i.e. was
     actually orphaned, not reaped too) before moving on
  5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)`
     every 100ms for up to 10s
- supporting helpers: `_read_marker()` with per-proc
  bytes-buffer to carry partial lines across calls,
  `_process_alive()` liveness probe via `kill(pid, 0)`
- Linux-only via `platform.system() != 'Linux'` skip —
  orphan-reparenting semantics don't generalize to
  other platforms
- port offset (`reg_addr[1] + 17`) so the harness listener
  doesn't race concurrently-running backend tests
- best-effort `finally:` cleanup: `SIGKILL` any still-alive
  pids + `proc.kill()` + bounded `proc.wait()` to avoid
  leaking orphans across the session

Also, tier-4 header comment documents the cross-backend
generalization path: applicable to any multi-process
backend (`trio`, `mp_spawn`, `mp_forkserver`,
`subint_forkserver`), NOT to plain `subint` (in-process
subints have no orphan OS-child). Move path: lift
harness into `tests/_orphan_harness.py`, parametrize on
session `_spawn_method`, add
`skipif _spawn_method == 'subint'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 7804a9feac Refactor `_runtime_vars` into pure get/set API
Post-fork `_runtime_vars` reset in `subint_forkserver_proc`
was previously done via direct mutation of
`_state._runtime_vars` from an external module + an inline
default dict duplicating the `_state.py`-internal defaults.
Split the access surface into a pure getter + explicit
setter so the reset call site becomes a one-liner
composition.

Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
  `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
  live `_runtime_vars` is now initialised from
  `dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
  kwarg. When True, returns a fresh copy of
  `_RUNTIME_VARS_DEFAULTS` instead of the live dict —
  still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
  atomic replacement of the live dict's contents via
  `.clear()` + `.update()`, so existing references to the
  same dict object remain valid. Accepts either the
  historical dict form or the `RuntimeVars` struct

Deats `tractor/spawn/_subint_forkserver.py`,
- collapse the prior ad-hoc `.update({...})` block into
  `set_runtime_vars(get_runtime_vars(clear_values=True))`
- drop the `_state._current_actor = None` line —
  `_trio_main` unconditionally overwrites it downstream,
  so no explicit reset needed (noted in the XXX comment)

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 63ab7c986b Reset post-fork `_state` in forkserver child
`os.fork()` inherits the parent's entire memory image,
including `tractor.runtime._state` globals that encode
"this process is the root actor" — `_runtime_vars`'s
`_is_root=True`, pre-populated `_root_mailbox` +
`_registry_addrs`, and the parent's `_current_actor`
singleton.

A fresh `exec`-based child starts with those globals at
their module-level defaults (all falsey/empty). The
forkserver child needs to match that shape BEFORE calling
`_actor_child_main()`, otherwise `Actor.__init__()` takes
the `is_root_process() == True` branch and pre-populates
`self.enable_modules`, which then trips
`assert not self.enable_modules` at the top of
`Actor._from_parent()` on the subsequent parent→child
`SpawnSpec` handshake.

Fix: at the start of `_child_target`, null
`_state._current_actor` and overwrite `_runtime_vars` with
a cold-root blank (`_is_root=False`, empty mailbox/addrs,
`_debug_mode=False`) before `_actor_child_main()` runs.

Found-via: `test_subint_forkserver_spawn_basic` hitting
the `enable_modules` assert on child-side runtime boot.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 26914fde75 Wire `subint_forkserver` as first-class backend
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.

Deats,
- new `subint_forkserver_proc()` spawn target in
  `_subint_forkserver`:
  - mirrors `trio_proc()`'s supervision model — real OS
    subprocess so `Portal.cancel_actor()` + `soft_kill()`
    on graceful teardown, `os.kill(SIGKILL)` on hard-reap
    (no `_interpreters.destroy()` race to fuss over bc the
    child lives in its own process)
  - only real diff from `trio_proc` is the spawn mechanism:
    fork from a main-interp worker thread via
    `fork_from_worker_thread()` (off-loaded to trio's
    thread pool) instead of `trio.lowlevel.open_process()`
  - child-side `_child_target` closure runs
    `tractor._child._actor_child_main()` with
    `spawn_method='trio'` — the child is a regular trio
    actor, "subint_forkserver" names how the parent
    spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
  shim around a raw OS pid: `.poll()` via
  `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
  cache thread, `.kill()` via `SIGKILL`, `.returncode`
  cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
  are `None` (fork-w/o-exec inherits parent FDs; we don't
  marshal them) which matches `soft_kill()`'s `is not None`
  guards

Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.

Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi cf2e71d87f Add `subint_forkserver` PEP 684 audit-plan doc
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.

Deats,
- three reasons enumerated for the current constraints:
  - class-A GIL-starvation — **fixed** by isolated mode:
    subints don't share main's GIL so abandoned-thread
    contention disappears
  - destroy race / tstate-recycling from `subint_proc` —
    **unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
    cross-mode, so isolated doesn't obviously fix it;
    needs empirical retest on py3.14 + isolated API
  - fork-from-main-interp-tstate (the CPython-level
    `_PyInterpreterState_DeleteExceptMain` gate) — the
    narrow reason for using a dedicated thread; **probably
    fixed** IF the destroy-race also resolves (bc trio's
    cache threads never drove subints → clean main-interp
    tstate)
- TL;DR table of which constraints unwind under each
  resolution branch
- four-step audit plan for when `msgspec#563` lands:
  - flip `_subint` to isolated mode
  - empirical destroy-race retest
  - audit `_subint_forkserver.py` — drop `non_trio`
    qualifier / maybe inline primitives
  - doc fallout — close the three `subint_*_issue.md`
    siblings w/ post-mortem notes

Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 25e400d526 Add trio-parent tests for `_subint_forkserver`
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.

Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
  for the GIL-hostage safety reason doc'd in
  `ai/conc-anal/subint_sigint_starvation_issue.md`:
  - `test_fork_from_worker_thread_via_trio` — parent-side
    plumbing baseline. `trio.run()` off-loads forkserver
    prims via `trio.to_thread.run_sync()` + asserts the
    child reaps cleanly
  - `test_fork_and_run_trio_in_child` — end-to-end: forked
    child calls `run_subint_in_worker_thread()` with a
    bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
  `dump_on_hang()` for post-mortem if the outer
  `pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
  drive the primitives directly rather than going through
  tractor's spawn-method registry (which the forkserver
  isn't plugged into yet)

Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.

Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 82332fbceb Lift fork prims into `_subint_forkserver` mod
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.

Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
  - `fork_from_worker_thread(child_target, thread_name)` —
    spawn a main-interp `threading.Thread`, call `os.fork()`
    from it, shuttle the child pid back to main via a pipe
  - `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
    create a fresh subint + drive `_interpreters.exec()` on
    a dedicated worker thread running the `bootstrap` str
    (typically imports `trio`, defines an async entry, calls
    `trio.run()`)
  - `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
    pass/fail classification reusable from harness AND the
    eventual real spawn path
- feature-gated py3.14+ via the public
  `concurrent.interpreters` presence check; matches the gate
  in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
  (cross-refs `_subint_fork` stub + the two `conc-anal/`
  docs) and status: EXPERIMENTAL, not yet registered in
  `_spawn._methods`

Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).

Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi de4f470b6c Add CPython-level `subint_fork` workaround smoketest
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.

Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.

Deats,
- four scenarios via `--scenario`:
  - `control_subint_thread_fork` — the KNOWN-BROKEN case as a
    harness sanity; if the child DOESN'T abort, our analysis
    is wrong
  - `main_thread_fork` — baseline sanity, must always succeed
  - `worker_thread_fork` — architectural assertion: regular
    `threading.Thread` attached to main interp calls
    `os.fork()`; child should survive post-fork cleanup
  - `full_architecture` — end-to-end: fork from a main-interp
    worker thread, then in child create a subint driving a
    worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
  "child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
  `os.waitpid()` of the parent + per-scenario status prints to
  observe the child's fate

Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).

Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00