Commit Graph

2652 Commits (72a0465c5218c2c06b50841fcf890eb7d139d30d)

Author SHA1 Message Date
Gud Boi 72a0465c52 Default `--ll` to `None` in test harness
Only override `tractor.log._default_loglevel` when
the flag is explicitly passed — lets per-spawn and
per-example `loglevel` kwargs take effect instead
of being clobbered by the hard-coded `'ERROR'`
default.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-01 00:18:18 -04:00
Gud Boi 9431a81d37 Update debug examples + harden `test_debugger`
Pass explicit `loglevel` to `spawn()` calls in
`test_debugger` tests — required for pexpect
pattern matching now that examples no longer
hard-code log levels.

Also,
- make `expect()` return the decoded `before` str.
- add `start_method` param + fork-backend timeout
  slack (+4s) in nested-error test.
- clean up debug examples: drop unused loglevels,
  rename `n` -> `an`, fix docstrings, add TODO
  comments for tpt parametrize via osenv.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-05-01 00:13:22 -04:00
Gud Boi fc2e298a29 Update `sync_bp` + tighten `test_pause_from_sync`
Add `disable_pdbp_color()` to the `sync_bp` example
to suppress pygments prompt coloring when
`PYTHON_COLORS=0` — makes pexpect pattern matching
deterministic.

Deats,
- set `loglevel='pdb'` in both script + test spawn.
- disable `enable_stack_on_sig` in example, assert
  no `stackscope` output in test.
- update `attach_patts` keys/values with `|_<Task`
  / `|_<Thread` / `|_('subactor'` prefixes to match
  actual tree-dump format.
- add call-site patterns (`tractor.pause_from_sync()`
  `tractor.pause()`, `breakpoint(hide_tb=...)`).
- trim trailing `\n` from `Lock.repr()` output.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 20:54:50 -04:00
Gud Boi 48523358cf Add `use_stackscope` runtime var for subactor init
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.

Deats,
- add `use_stackscope: bool` to `RuntimeVars`
  struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
  successful `stackscope` import, asserts unset
  on `ImportError`
- nest stackscope init under `_debug_mode` gate
  in `Actor.async_main`, check rtvar alongside
  env var
- defer `maybe_init_greenback` import to its own
  `use_greenback` branch

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 20:50:07 -04:00
Gud Boi e2b790a70d Fix `SIGUSR1` tree-dump ordering in `_stackscope`
Factor the sub-actor relay loop out of
`dump_tree_on_sig()` into `_relay_sig_to_subactors()`
and chain both dump + relay in a single
`run_sync_soon` callback (`_dump_then_relay`) so the
parent's task-tree flushes BEFORE any sub receives
the signal — fixes a hierarchical-ordering race
where subs could dump ahead of the parent in the
muxed pty stream.

Also,
- gate file/tty sink writes behind `write_file` +
  `write_tty` params on `dump_task_tree()`.
- use `actor.aid.uid` instead of deprecated `.uid`.
- update `test_shield_pause` expects to match the
  new sequential parent -> relay-log -> sub ordering.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:35:55 -04:00
Gud Boi 61d4525137 Add `pytest_load_initial_conftests()` for `--capture=`
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.

Also,
- load `tractor._testing.pytest` via `-p` in ini
  (bc bootstrapping hooks must register before
  conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
  tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
  per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
  and pop env var on disable.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:29:51 -04:00
Gud Boi 0996a83655 Add `--uds`/`--uds-only` flags to `tractor-reap`
Wire up `find_orphaned_uds()` + `reap_uds()` from
`_reap` as a new phase-3 UDS sweep in the CLI
script. Opt-in via `--uds` (run after proc reap +
shm) or `--uds-only` (skip other phases).

Also,
- consolidate skip-proc-reap logic into a single
  `skip_proc_reap` bool covering both `--shm-only`
  and `--uds-only`
- extend header docstring + usage examples

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:26:15 -04:00
Gud Boi 1cdc7fb302 Add UDS orphan-sweep helpers + reap fixtures to `_reap`
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:

- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
  `reap_uds()` — detect `<name>@<pid>.sock` files under
  `${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
  the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
  lingering subactors, wait, SIGKILL survivors, then sweep orphaned
  UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
  snapshot sock-file dir before/after each test, warn + reap new
  orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
  with known-leaky teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 19:21:02 -04:00
Gud Boi 486249d74f Allow per-call `start_method`/`loglevel` overrides
In `tests/devx/conftest.py::spawn`, refactor the
fixture-internal closures so consumer tests can pass
explicit `start_method`/`loglevel` to each `_spawn()`
invocation rather than only inheriting the fixture-
scoped parametrize values.

Deats,
- promote `set_spawn_method()` and `set_loglevel()`
  to take their respective values as fn params (vs
  closing over the fixture-scope vars).
- give `_spawn()` `start_method=start_method` and
  `loglevel: str|None = None` kwargs so callers
  override one-off without re-parametrizing the
  suite. NOTE: this drops the implicit fixture-
  scoped `loglevel` forward — `_spawn()` callers
  now must pass `loglevel=...` explicitly.
- TODO: figure out how `--ll <level>` should map to
  the default (currently `None` → uses env-var or
  tractor default).
- add a docstring to `_spawn()` so its role as the
  consumer-facing closure is obvious from `help()`.

Also,
- `assert_before()` now returns the `.before` output
  on success (was `None`); add a one-line docstring
  describing the new return contract.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-30 14:17:41 -04:00
Gud Boi 8bc304f094 TOSQUASH 2d4995e0, fix _pformat -> devx.pformat.. 2026-04-29 18:47:29 -04:00
Gud Boi fc5e80fea5 Drop subint-family gate from `main_thread_forkserver`
`main_thread_forkserver` doesn't actually need py3.14
`concurrent.interpreters` (PEP 734) — it forks from a
non-trio worker thread and runs `_trio_main` in the child,
same shape as `trio_proc`. The previous `_has_subints`
gate + subint-family `case` arm were a copy-paste error.

In `tractor.spawn._main_thread_forkserver`,
- drop the `_has_subints` import + the `RuntimeError`
  raise in `main_thread_forkserver_proc()`.
- drop the now-unused `import sys` (only used by the
  prior error msg).

In `tractor.spawn._spawn.try_set_start_method()`,
- pull `'main_thread_forkserver'` out of the subint-
  family arm (which still gates on `_has_subints`).
- merge it into the `'trio'` arm — both set `_ctx = None`
  bc neither needs an `mp.context`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 18:13:46 -04:00
Gud Boi b7115fc875 Drop test-local timeouts, +`sync_pause` to dev
In `pyproject.toml`,
- include the `sync_pause` group from `dev`, so dev
  installs ship `greenback` for `pause_from_sync()`.

Comment out per-test `@pytest.mark.timeout(...)`
markers in,
- `tests/devx/test_debugger.py`
- `tests/discovery/test_registrar.py`
- `tests/spawn/test_main_thread_forkserver.py`
- `tests/spawn/test_subint_cancellation.py`
- `tests/test_advanced_streaming.py`
- `tests/test_cancellation.py`

The global cap was already dropped (3c366cac); these
were the leftover per-test caps which now block
interactive `pdb` flows under the new spawn backends.

In `uv.lock`,
- pull `greenback` into the resolved `dev` deps
  (per the `sync_pause` include above).
- catch up the prior `xonsh` editable→PyPI switch
  (from the `pyproject.toml` `tool.uv.sources` edit).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 18:10:40 -04:00
Gud Boi 208e7c0926 Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars
Add env-var overrides inside `._root.open_root_actor()` so
devs/test-runs can swap the actor-spawn backend or crank
console verbosity *without* touching application code.

In `._root.open_root_actor()`,
- read `TRACTOR_LOGLEVEL` early, overriding any caller-passed
  `loglevel` and stashing an `env_ll_report` to emit once the
  console log is set up.
- pull the `loglevel` fallback (`or _default_loglevel`) and
  `log.get_console_log()` init *up* so the env-var report
  routes through tractor's own logger.
- read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed
  `start_method` and warn-logging when the env-var clobbers
  an explicit caller value.

Wire the same vars through `tests/devx/conftest.py::spawn`,
- request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL`
  and `TRACTOR_SPAWN_METHOD` in `os.environ` before each
  `pexpect.spawn()` (inherited by the example subproc).
- expand `supported_spawners` to include
  `main_thread_forkserver` and `subint_forkserver` bc
  example scripts no longer need per-script CLI plumbing.
- pop both vars in fixture teardown so a leaked value can't
  re-route a later in-process tractor test's spawn-backend
  or loglevel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 17:29:38 -04:00
Gud Boi 22cdf15b73 Flip back to default `pytest` capture for CI 2026-04-29 15:03:26 -04:00
Gud Boi 532a9834f3 Add posix-multithreaded-`fork()` explainer doc 2026-04-29 12:50:23 -04:00
Gud Boi 2917b74ba4 Add todo for running `test_debugger` suite on forkserver spawner 2026-04-29 12:49:36 -04:00
Gud Boi 2d4995e08d Route `stackscope` SIGUSR1 onto trio loop
Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.

Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.

Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).

`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.

Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
  hint in the rendered dump header so users know where
  to find the artifact even when stdio is captured.
- swap the inline `f'     |_{actor}'` line for a
  `_pformat.nest_from_op` rendering of `actor_repr`
  (matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
  branches now note `(trio_token captured: <bool>)`
  so it's obvious from the log whether the full-tree
  path is wired.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 12:01:03 -04:00
Gud Boi 8c730193f9 Refine fork-survival docs + `EBADF` handling
Two cleanup tweaks in `_main_thread_forkserver`:

Doc, "what survives the fork?" section — expand the
"non-calling threads are gone in the child" claim with
the precise execution-vs-memory split that reconciles
this module's prior framing with trio's (canonical
[python-trio/trio#1614][trio-1614]) "leaked stacks"
framing:

- execution-side: only the calling thread runs
  post-fork; all others never execute another
  instruction.
- memory-side: those non-running threads' stacks +
  per-thread heap structures are still COW-inherited
  as orphaned bytes — what trio means by "leaked".

Same POSIX reality, opposite sides; the table is
extended to a 4-col `parent | child (executing) |
child (memory)` layout to make both views explicit.
Also blank-line-padded the bulleted hazard classes
for cleaner markdown rendering.

[trio-1614]: https://github.com/python-trio/trio/issues/1614

Code, `_close_inherited_fds()` log noise — split the
catch-all `except OSError` into:

- `EBADF` — benign race where the dirfd that
  `os.listdir('/proc/self/fd')` itself opened ends up
  in `candidates`, then auto-closes before the loop
  reaches it. Demote to `log.debug()` + `continue`;
  prior `log.exception` drowned the post-fork log
  channel with stack traces every spawn.
- other errnos (EIO / EPERM / EINTR / ...) keep the
  loud `log.exception` surface — those ARE genuinely
  unexpected.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:34:33 -04:00
Gud Boi 5418f2dc3c Add `--enable-stackscope` pytest plugin flag
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.

Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
  * exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
    inherit it via environ,
  * installs the handler in pytest itself via
    `enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
  existing `_debug_mode` gate to ALSO fire when
  `TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
  install the same handler at runtime startup.

Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:

- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
  always written) — guaranteed-readable artifact even
  under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
  tty; ignored if device is missing.

Other,
- squelch the benign `RuntimeWarning` ("coroutine method
  'asend'/'athrow' was never awaited") from
  `stackscope._glue`'s import-time async-gen type
  introspection so `--enable-stackscope` setup stays
  quiet.
- log msg in the `_runtime` ImportError branch now
  mentions `--enable-stackscope` alongside debug-mode.

Usage,
  pytest --enable-stackscope -k <hang-test>
  # in another shell, find the pid + signal:
  kill -USR1 <pytest-or-subactor-pid>
  # tail the artifact:
  tail -f /tmp/tractor-stackscope-<pid>.log

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:32:23 -04:00
Gud Boi 383b0fdd75 Backend-aware `fail_after` in pub/sub test
Mirror `060f7d24`'s pattern (backend-aware timeout in
`maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard
`trio.fail_after` cap. Fork-based backends pay per-spawn
fork+IPC-handshake cost which stacks over `cpus - 1`
sequential `n.run_in_actor()` calls; empirically 12s
flakes on `main_thread_forkserver` under UDS
cross-pytest contention (#451 / #452).

Defaults:
- `main_thread_forkserver` → 30s
- everything else          → 12s (unchanged)

Hoist the timeout-pick out of the `main()` closure so the
dispatch happens once in the trio task rather than
re-evaluating per spawn.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:28:48 -04:00
Gud Boi 060f7d24c4 Backend-aware timeout in `maybe_expect_raises`
Default `timeout` from `int = 3` → `int|None = None`;
when unset, pick a backend-aware value. Fork-based
backends (`main_thread_forkserver`) need real headroom
bc actor spawn + IPC ctx-exit + msg-validation error
path is much heavier than under `trio` backend —
especially under cross-pytest-stream contention (#451).

Defaults:
- `main_thread_forkserver` → 30s
- everything else          → 3s (unchanged)

Empirical flake history that motivated 30s as the floor
on fork backends (all from `test_basic_payload_spec`):

- 3s  → all-valid variant flaked w/ `TooSlowError`
- 8s  → `invalid-return` variant flaked w/ `Cancelled`
        (surfaced instead of `MsgTypeError` bc the
        outer `fail_after` fired mid-error-path)
- 15s → flaked under cross-pytest-stream contention

30s gives plenty of headroom while still failing-loud
on a genuine hang. Callers can opt out by passing an
explicit `timeout=` kw.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-29 10:21:56 -04:00
Gud Boi 3c366cac13 Drop global `pytest-timeout` cap from `pyproject.toml`
`timeout = 200` was firing via SIGALRM (the default
`method='signal'`) which synchronously raises `Failed` in
trio's main thread mid-`epoll.poll()`, abandoning trio's
runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half-
installed. EVERY subsequent `trio.run()` in the same pytest
session then bails with
`RuntimeError: Attempted to call run() from inside a run()`.

Empirical impact: a session that hits a single 200s hang
cascades into 30-40 false-positive failures across every
downstream test file that uses `trio.run`. Recent UDS run
saw 1 real timeout (`test_unregistered_err_still_relayed`)
poison 38 sibling tests with cascade-fails — a debugging
nightmare.

Same architectural bug we already documented in
`tests/test_advanced_streaming.py::test_dynamic_pub_sub`
(see its module-level NOTE) — both `pytest-timeout`
enforcement modes are incompatible with trio under fork-
based spawn backends. Now scoped session-wide.

For tests that legitimately need a wall-clock cap, the
canonical pattern is `with trio.fail_after(N):` INSIDE the
test — trio's own `Cancelled` machinery cleanly unwinds
the actor nursery without disturbing global state.

For CI: rely on job-level wall-clock timeouts (e.g. GitHub
Actions `timeout-minutes`) to abort genuinely-stuck suites.

`pyproject.toml` comment block spells this all out so a
future contributor doesn't reach back for `timeout =` and
re-introduce the bug.

ALSO, bump `xonsh` to at least `0.23.0` release.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-28 16:00:16 -04:00
Gud Boi f8178df0fd Return parent `pid: int` from new `reap_subactors_per_test` fixture 2026-04-27 23:27:19 -04:00
Gud Boi 530160fa69 Use `trio.fail_after` cap in `test_dynamic_pub_sub`
Drop `@pytest.mark.timeout(...)` for the per-test wall-clock
cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)`
inside `main()` instead.

Both pytest-timeout enforcement modes are incompatible with
trio under fork-based backends:

- `method='signal'` (SIGALRM) synchronously raises `Failed`
  in trio's main thread mid-`epoll.poll()`, leaving
  `GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got
  abandoned") so EVERY subsequent `trio.run()` in the same
  pytest process bails with
  `RuntimeError: Attempted to call run() from inside a run()`
  — full-session poison.
- `method='thread'` calls `_thread.interrupt_main()` which
  can let the KBI escape trio's `KIManager` under fork-
  cascade teardown races and bubble out of pytest entirely
  — kills the whole session.

`trio.fail_after()` keeps cancellation inside the trio loop:
- Raises `TooSlowError` cleanly through the open-nursery's
  cancel cascade.
- Doesn't disturb any out-of-band signal/thread state.
- Failure stays scoped to the single test — no cross-test
  global state corruption either way.

Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub`
go from 5/10 fail (with global-state poison) to 3/10 fail
(no poison, all sibling tests still pass). The ~30%
remaining flake rate is a genuine fork-cancel-cascade
hang — separate from this fix but no longer contaminates.

Module-level NOTE comment explains the rationale so future
readers don't re-introduce the bug.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 23:25:04 -04:00
Gud Boi b376eb0332 Add opt-in `reap_subactors_per_test` fixture
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.

Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...

Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.

Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 21:41:02 -04:00
Gud Boi 7c5dd4d033 Fix `_testing.addr.get_rando_addr` cross-process collisions
Previously the random port was a default-arg expression
(`_rando_port: str = random.randint(1000, 9999)`) — evaluated
ONCE at module import time, making it a per-process singleton.
Two parallel pytest sessions had a 1/9000 birthday-pair chance
of picking the same port; when it hit, every `reg_addr`-using
test in BOTH runs would cascade-fail with "Address already in
use".

Switch to per-call `random.randint()` salted with `os.getpid()`
so:

- within one session: two calls return distinct ports — e.g.
  `test_tpt_bind_addrs::bind-subset-reg` now actually gets two
  different reg addrs on the TCP backend (it was silently
  duplicating before),
- across parallel sessions: pid salt biases each process's
  port choices apart, making cross-run collisions
  vanishingly rare.

Drop the bogus `: str` annotation (was always `int`). UDS already gets
per-process isolation via `UDSAddress.get_random()`'s `@<pid>`
socket-path suffix, so no change needed there.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 20:15:20 -04:00
Gud Boi cbdf1eb6db Guard `subint_forkserver` stub against re-alias
Add `test_subint_forkserver_key_errors_cleanly` — a tn-tier
regression guard that pins down the variant-2 reservation
contract: the `'subint_forkserver'` key in
`_spawn._methods` MUST raise `NotImplementedError` today,
not silently dispatch to `main_thread_forkserver_proc`.

The transient alias-state existed briefly during the rename
(commit `57dae0e4`'s "Split forkserver backend into variant
1/2 mods" landed the alias; `5e83881f` flipped it to the
stub). Without a guard, a future refactor could easily
re-collapse the two keys back to a single coro and silently
break the variant-1 / variant-2 contract.

Also asserts the stub's error msg surfaces the two pointers
an operator hitting it actually needs:

- `'main_thread_forkserver'` — the working backend they
  prolly meant,
- `'msgspec#1026'` — the upstream blocker that has to land
  before variant-2 can ship.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 20:06:44 -04:00
Gud Boi 205382a39b Sweep `subint_forkserver` → `main_thread_forkserver` in code
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:

- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
  `main_thread_forkserver`, drop the stub-only `subint_forkserver`
  entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
  capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
  xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
  reason-text since the `mp.SharedMemory(track=False)`
  + resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
  the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
  `_testing.pytest`, `_subint.py`, `pyproject.toml`,
  `test_cancellation.py`, `test_registrar.py` — refs to variant-1
  backend updated.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 19:55:37 -04:00
Gud Boi 9f0709eee2 Migrate test/smoketest imports + rename test file
Rename `tests/spawn/test_subint_forkserver.py` →
`test_main_thread_forkserver.py` and migrate its imports +
internal refs to the new canonical names:

- `fork_from_worker_thread`, `wait_child` → from
  `tractor.spawn._main_thread_forkserver`.
- `run_subint_in_worker_thread` → still from `_subint_forkserver`
  (variant-2 primitive).
- Module docstring + tier-3 fixture + the `*_spawn_basic` test fn
  renamed for variant-1-honesty.
- Orphan-harness subprocess argv flipped from `'subint_forkserver'`
  → `'main_thread_forkserver'`.

`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` imports split
the same way.

`tractor/spawn/_subint_forkserver.py` drops the backward- compat
re-exports of the fork primitives — the only consumers (test file
+ smoketest) now import from `_main_thread_forkserver` directly.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 19:47:44 -04:00
Gud Boi 5e83881f10 Add `subint_forkserver_proc` stub, flip dispatch, prune
Reduce `_subint_forkserver.py` to its variant-2 placeholder shape:

- Add `subint_forkserver_proc` async stub raising `NotImplementedError`
  with a redirect msg pointing at the working variant-1 backend
  (`main_thread_forkserver`), jcrist/msgspec#1026 (upstream PEP 684
  blocker), and #379 (subint umbrella).

- `tractor.spawn._spawn._methods['subint_forkserver']` now dispatches to
  the stub instead of aliasing the variant-1 coroutine
  — `--spawn-backend=subint_forkserver` errors cleanly.

- Drop now-dead module-scope: `ChildSigintMode`
  / `_DEFAULT_CHILD_SIGINT` defs, `_has_subints` try/except (replaced
  with import from `._subint`), unused imports (`partial`, `Literal`,
  `sys`, msgtypes/pretty_struct, `current_actor`,
  `cancel_on_completion`/`soft_kill`, `_server` TYPE_CHECKING).

- Backward-compat re-exports of fork primitives kept until the follow-up
  commit migrates external test imports.

- `tests/spawn/test_subint_forkserver.py::forkserver_spawn_method`
  fixture: flip hardcoded `'subint_forkserver'`
  → `'main_thread_forkserver'` so the test still exercises the working
  backend (full file rename comes in the test-import migration commit).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 19:36:08 -04:00
Gud Boi 57dae0e4a6 Split forkserver backend into variant 1/2 mods
The `subint_forkserver` name was always aspirational —
today's impl forks from a regular main-interp worker
thread and the child runs trio on its own main interp;
NO subinterp anywhere in parent or child. Splitting the
backend into two clearly-named variants drops the lie:

- **variant 1** — `main_thread_forkserver` (the working
  impl). New `SpawnMethodKey` literal + `_methods`
  dispatch entry + `_runtime.Actor._from_parent()`
  match-arm. The spawn-coro `subint_forkserver_proc`
  moves to `_main_thread_forkserver` and is renamed
  `main_thread_forkserver_proc()`.

- **variant 2** — `subint_forkserver` (future, reserved).
  Module shrinks to a placeholder describing the
  variant-2 design (subint-isolated child runtime, gated
  on jcrist/msgspec#1026 + PEP 684). Today the legacy
  `'subint_forkserver'` key aliases to
  `main_thread_forkserver_proc` so existing
  `--spawn-backend=subint_forkserver` invocations keep
  working; flipped to a `NotImplementedError` stub in a
  follow-up.

Deats,
- `Actor._from_parent()` spawn-method gate now accepts
  both `'main_thread_forkserver'` and
  `'subint_forkserver'` (both go through the
  IPC-`SpawnSpec` path).
- the variant-1 spawn-coro stamps its own `SpawnSpec` /
  log lines with `spawn_method='main_thread_forkserver'`
  so subactor renders reflect the actual mechanism.
- docstring reorg: trio×fork hazard breakdown, POSIX
  fork-survival semantics, in-process-vs-stdlib
  forkserver design notes, and the TODO/cleanup section
  all move from `_subint_forkserver` to
  `_main_thread_forkserver` (lives with the working
  code). `_subint_forkserver` keeps a tight forward-
  looking doc that motivates the reserved key.
- `run_subint_in_worker_thread()` stays in
  `_subint_forkserver` as the companion primitive — it's
  the subint counterpart to `fork_from_worker_thread()`
  and will plug into the future variant-2 spawn-coro.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 19:28:11 -04:00
Gud Boi 99dade0fb3 Extract fork primitives into `_main_thread_forkserver`
Move the truly-generic main-interp-worker-thread fork primitives
(`fork_from_worker_thread`, `_close_inherited_fds`, `_ForkedProc`,
`wait_child`, `_format_child_exit`) out of `_subint_forkserver.py` into
a sibling `_main_thread_forkserver.py` module so the primitive layer is
honestly named — none of these helpers touch a subint, they just fork
from a main-interp worker thread.

`_subint_forkserver.py` keeps its public surface intact via re-export so
any existing `from tractor.spawn._subint_forkserver import ...` callsite
still resolves.

Net: zero behavior change, preps the way for the upcoming spawn-method
key split where `main_thread_forkserver` ships as the working backend
and `subint_forkserver` becomes reserved for the future
subint-isolated-child variant (gated on jcrist/msgspec#1026).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 19:04:26 -04:00
Gud Boi 4b5176e2c3 Doc future-subint payoffs for `_subint_forkserver`
Adds a "Future arch — what subints would buy us" section to
the module docstring, complementing the prior commit's
current-state rationale. Code is unchanged.

Frames the `subint` prefix as family-naming today (no actual
subinterp is created yet), then lays out the three concrete
wins that land once jcrist/msgspec#1026 unblocks PEP 684
isolated-mode subints:

- Cheaper forks — moving the parent's `trio.run()` into a
  subint shrinks the main-interp COW image the child inherits.
  The main interp becomes the literal forkserver: an
  intentionally-empty execution ctx whose only job is to call
  `os.fork()` cleanly.

- True parallelism — per-interp GIL means the forkserver
  thread on main and the trio thread on subint actually run in
  parallel. Spawn latency stops stalling the trio loop.

- Multi-actor-per-process — the architectural payoff. With
  per-interp-GIL subints, one process can host main + N
  subint-resident actor `trio.run()`s, and `os.fork()` reverts
  to the last-resort spawn (only when OS-level isolation is
  actually needed). Joins the story with the in-thread
  `_subint.py` backend: `subint` → in-process spawn,
  `subint_forkserver` → cross-process when a real OS boundary
  is required.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 18:20:10 -04:00
Gud Boi 3ab99d557a Doc `_subint_forkserver` design + fork semantics
Major expansion of the module docstring. Code is
unchanged; this lands the architectural reasoning that
was previously implicit, plus the POSIX/trio fork
mechanics the design relies on.

New sections:
- "Design rationale" — answers two implicit questions:
  (1) why a forkserver pattern at all (vs. forking
  directly from a trio task), (2) why in-process (vs.
  stdlib `mp.forkserver`'s sidecar process). Documents
  the three costs the in-process design avoids
  (sidecar lifecycle, per-spawn IPC, cold-start child)
  and the tradeoffs we accept in exchange (3.14-only,
  heavier than `to_thread.run_sync`).
- "Implementation status" — clarifies what's actually
  landed today vs. the envisioned arch: parent's
  `trio.run()` still lives on main interp (subint-
  hosted root gated on jcrist/msgspec#1026). Names
  why the "subint" prefix is correct anyway — same PR
  series as `_subint.py` / `_subint_fork.py`.
- "What survives the fork? — POSIX semantics" — POSIX
  preserves only the calling thread, so the
  `trio.run()` thread is gone in the child. Includes
  a small parent/child thread-survival table and
  covers the four artifact classes that DO cross the
  fork boundary (inherited fds, COW memory, Python
  thread state, user-level locks) and how each is
  handled.
- "FYI: how this dodges the `trio.run()` × `fork()`
  hazards" — itemizes each class of trio process-
  global state (wakeup-fd, `epoll`/`kqueue`,
  threadpool, cancel scopes / nurseries, `atexit`,
  foreign-language I/O) and explains how the
  forkserver-thread design avoids each.

Also,
- bump the gated msgspec issue link from
  `jcrist/msgspec#563` to `jcrist/msgspec#1026` (the
  PEP 684 isolated-mode tracker).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 18:16:50 -04:00
Gud Boi 54561959e6 Log subint bootstrap excs + cancel-leak state
Two diagnostic gaps in `tractor.spawn._subint.subint_proc()` that hid
otherwise-silent failures, plus tracking-issue links on the two open
`subint_forkserver` follow-ups.

Deats,
- bootstrap-exc visibility: wrap the call to
  `_interpreters.exec(interp_id, bootstrap)` with
  `try/except BaseException` + `log.exception(...)`.
  * Without it, an `ImportError` / `SyntaxError` raised inside the
    dedicated driver thread goes only to Python's default thread
    excepthook — invisible to the parent, which then waits forever on
    `subint_exited.wait()`.
  * `?TODO` notes `anyio`'s `to_interpreter._interp_call` +
    `(retval, is_exception)` pattern as the next step for re-raising;
    skipped now bc it must coordinate with the `trio.Cancelled` paths
    around the existing `.wait()` calls.

- cancel-leak disambiguation: when the driver thread doesn't exit within
  `_HARD_KILL_TIMEOUT`, also log `_interpreters.is_running(interp_id)`
  as `subint_still_running=...` so the operator can tell "thread leaked,
  subint already done" apart from "thread alive bc subint is wedged".
  * pattern borrowed from `trio-parallel`'s `_sint.SintWorker.is_alive()`.

- `?TODO` near the `bootstrap` literal: future switch to
  `_interpreters.set___main___attrs()` — same API `anyio`
  uses in `to_interpreter._Worker.call()` — for passing
  non-`repr()`-roundtrippable values (`SpawnSpec` struct, callables,
  etc).
  * add cross-refs tracking issue `#379`.

Also,
- `Tracked at: [#449]` link on
  `subint_forkserver_test_cancellation_leak_issue.md`.
- `Tracked at: [#450]` link on
  `subint_forkserver_thread_constraints_on_pep684_issue.md`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 15:57:55 -04:00
Gud Boi 66f1941f46 Wire `reg_addr` into `test_context_stream_semantics`
Same wire-up pattern as the prior `test_dynamic_pub_sub`
commit: each test that already pulled in `debug_mode`
now also pulls in `reg_addr` and passes
`registry_addrs=[reg_addr]` into `tractor.open_nursery()`,
so the suite's standard registry-addr conventions apply.

Tests touched:
- `test_started_misuse`
- `test_simple_context`
- `test_parent_cancels`
- `test_one_end_stream_not_opened`
- `test_maybe_allow_overruns_stream`
- `test_ctx_with_self_actor`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 13:52:28 -04:00
Gud Boi 9b05f659b3 Wire `test_dynamic_pub_sub` to standard fixtures
Pull in the `reg_addr`, `debug_mode`, and `test_log`
fixtures so this test follows the same conventions as
the rest of the suite:

- pass `registry_addrs=[reg_addr]` + `debug_mode` into
  `tractor.open_nursery()` (so `--tpdb` etc work).
- after the `pytest.raises` block, add `assert err` +
  `test_log.exception('Timed out AS EXPECTED')` so the
  expected timeout is logged explicitly instead of
  swallowed.

Also,
- drop whitespace-only blank lines around the
  `subs` param of `consumer()` and `ctx` param of
  `one_task_streams_and_one_handles_reqresp()`.
- promote `test_sigint_both_stream_types`'s one-line
  docstring to multi-line form.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 12:59:00 -04:00
Gud Boi 65fcfbf224 Bump `test_stale_entry_is_deleted`'s timeout to 30
Seems that when run in-suite it delays more then the so-measured "happy
path" timing; better to have no suite-global interruption then asserting
a fast single test's run.
2026-04-27 11:46:45 -04:00
Gud Boi 4f12d69b41 Add `--shm` orphan sweep to `tractor-reap`
Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.

Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
  + `reap_shm()` helpers. Match criteria: regular file
  under `/dev/shm`, owned by current uid, AND no live
  proc has it open (mmap'd or fd-held). In-use
  enumeration via `psutil.Process.memory_maps()` +
  `.open_files()` — xplatform, kernel-canonical (same
  answer `lsof` would give), no reliance on
  tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
  `NotImplementedError` outside Linux/FreeBSD bc macOS
  POSIX shm has no fs-visible path (`shm_open` only)
  and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
  process reap) and `--shm-only` (skip process phase)
  flags. `-n` dry-runs both phases. Exit code is `1`
  if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
  the `testing` dep group; lazy-imported in `_reap.py`
  so the process-reap path stays import-clean without
  it.

Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
  (new section 10c) — covers match criteria + the
  preservation guarantee for unrelated apps.
- flip mitigation status in
  `subint_forkserver_mp_shared_memory_issue.md` from
  "could extend `tractor-reap`" to "implemented", with
  a note that callers should still UUID-pin shm keys to
  avoid cross-session collisions.

Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 11:35:33 -04:00
Gud Boi aa3e230926 Fix `SharedMemory` under `subint_forkserver`
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.

Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
  `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
  now run unconditionally:
  * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
    `ManTracker` subclass (empty `register` / `unregister`
    / `ensure_running`).
  * return `partial(SharedMemory, track=False)` for the per-allocation
    opt-out.
  * belt + suspenders: even if something dodges the wrapper, the
    singleton can't talk to the inherited (broken) parent fd.

- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
  skip of the unlink-callback; install a `try_unlink()` wrapper that
  swallows `FileNotFoundError` (sibling-already-cleaned race in
  shared-key setups). Without `mp.resource_tracker` doing it for us, we
  own the unlink — `actor. lifetime_stack` is the right place since
  tractor already controls actor lifecycle.

- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
  module-level skip- list (tests pass now). Inline comment cross-refs
  the two `_mp_bs` / `_shm` workarounds.

- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
  rewrite — flips status from "open / unresolvable in tractor" to
  "resolved, kept as decision record". Adds Resolution section, "Why
  this is the right call" rationale (mp tracker is widely criticized;
  tractor already owns lifecycle), trade-offs (crash-leaked segments,
  lost mp leak warning), verification (7 passed under both
  `subint_forkserver` and `trio` backends), and upstream issue links

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-27 10:51:28 -04:00
Gud Boi c99d475d03 Document `SharedMemory` × `subint_forkserver` incompat
New `ai/conc-anal/` doc: `mp.SharedMemory` is
fork-without-exec unsafe — child inherits parent's
`resource_tracker` fd → EBADF on first shm op;
leaked `/shm_list` cascades `FileExistsError`
across parametrize variants. Canonical CPython
issue class, NOT a tractor bug. Includes two
longer-term mitigation paths (reset inherited
tracker fd vs migrate off `mp.shared_memory`).

Also, update `tests/test_shm.py`:
- comment out `subint_forkserver` from skip list
- rewrite reason with precise failure-mode
  descriptions + link to the analysis doc

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-26 20:13:24 -04:00
Gud Boi 6d76b60404 Add `tractor-reap` CLI + document auto-reap
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:

- orphan-mode (default): finds PPid==1 procs
  with cwd matching repo root + `python` in
  cmdline.
- descendant-mode (`--parent <pid>`): scoped
  sweep under a still-live supervisor.

SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).

Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-26 18:04:40 -04:00
Gud Boi eae478f3d5 Add `_testing._reap` + auto-reap fixture
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.

Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
  `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
  - `find_descendants(parent_pid)` for the in-session case
    (PPid-direct-match while pytest is still alive).
  - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
    reparented to init + `cwd` filter to repo root + `python` cmdline
    filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
  all, poll up to `grace` for exit, SIGKILL any survivors. Returns
  `(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
  `tractor/_testing/pytest.py` — after `yield`, runs
  `find_descendants(os.getpid())` + `reap(...)` so each pytest session
  leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
  commit) for the pytest-died-mid-session case where the in-session
  fixture didn't get to run.

Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
  module-top in `pytest.py` (was inline-imported inside
  `pytest_generate_tests`), and reuse it in
  `pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
  mark arg is a valid spawn-method literal — catches typos at collection
  time.
- inline `# ?TODO` flags running these through the `try_set_backend`
  checker for stronger validation.

Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-25 00:05:58 -04:00
Gud Boi 44bdb1697c Tighten orphan-SIGINT xfail to `strict=True`
Re-classify `test_orphaned_subactor_sigint_cleanup_DRAFT` from
flakey-env-sensitive (`strict=False` w/ "passes in isolation, flakey in
full suite") to a hard known-gap (`strict=True`) with the orphan-SIGINT
hang as the documented cause. The previous framing ("env pollution") let
the test silently pass when ordering happened to favor it; the new
framing forces an XPASS-as-FAIL the moment the underlying gap is
actually closed, so we can drop the mark intentionally instead of
accidentally.

Reason text + leading `# Known-gap test —` comment both point at
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
for the full diagnosis.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-24 22:48:35 -04:00
Gud Boi 2ca0f41e61 Skip `test_loglevel_propagated_to_subactor` on subint forkserver too 2026-04-24 21:47:46 -04:00
Gud Boi b350aa09ee Wire `reg_addr` through infected-asyncio tests
Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.

Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
  calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
  - add `reg_addr`, `debug_mode`, `start_method`
    fixture params
  - `delay` now reads the `debug_mode` param directly
    instead of calling `tractor.debug_mode()` (fires
    slightly earlier in the test lifecycle)
  - sanity assert `if debug_mode: assert
    tractor.debug_mode()` after nursery open
  - new print showing SIGINT target
    (`send_sigint_to` + resolved pid)
  - catch `trio.TooSlowError` around
    `ctx.wait_for_result()` and conditionally
    `pytest.xfail` when `send_sigint_to == 'child'
    and start_method == 'subint_forkserver'` — the
    known orphan-SIGINT limitation tracked in
    `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-24 20:26:25 -04:00
Gud Boi d6e70e9de4 Import-or-skip `.devx.` tests requiring `greenback`
Which is for sure true on py3.14+ rn since `greenlet` didn't want to
build for us (yet).
2026-04-24 17:39:13 -04:00
Gud Boi 4c133ab541 Default `pytest` to use `--capture=sys`
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).

Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
  `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
  capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
  `sys.stderr`).
- KEPT: `pytest -s` debugging behavior.

This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.

Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
  rationale comment cross-ref'ing the post-mortem doc

- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
  from `test_nested_ multierrors` — no longer needed.
  * file-level `pytestmark` covers any residual.

- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
  mark loosened from `strict=True` to `strict=False` + reason rewritten.
  * it passes in isolation but is session-env-pollution sensitive
    (leftover subactor PIDs competing for ports / inheriting harness
    FDs).
  * tolerate both outcomes until suite isolation improves.

- `test_shm`: extend the existing
  `skipon_spawn_backend('subint', ...)` to also skip
  `'subint_forkserver'`.
  * Different root cause from the cancel-cascade class:
    `multiprocessing.SharedMemory`'s `resource_tracker` + internals
    assume fresh- process state, don't survive fork-without-exec cleanly

- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
  (unrelated to forkserver; just a flaky-under-load bump).

- `tractor.spawn._subint_forkserver`: inline comment-only future-work
  marker right before `_actor_child_main()` describing the planned
  conditional stdout/stderr-to-`/dev/null` redirect for cases where
  `--capture=sys` isn't enough (no code change — the redirect logic
  itself is deferred).

EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-24 14:17:23 -04:00
Gud Boi 4106ba73ea Codify capture-pipe hang lesson in skills
Encode the hard-won lesson from the forkserver
cancel-cascade investigation into two skill docs
so future sessions grep-find it before spelunking
into trio internals.

Deats,
- `.claude/skills/conc-anal/SKILL.md`:
  - new "Unbounded waits in cleanup paths"
    section — rule: bound every `await X.wait()`
    in cleanup paths with `trio.move_on_after()`
    unless the setter is unconditionally
    reachable. Recent example:
    `ipc_server.wait_for_no_more_peers()` in
    `async_main`'s finally (was unbounded,
    deadlocked when any peer handler stuck)
  - new "The capture-pipe-fill hang pattern"
    section — mechanism, grep-pointers to the
    existing `conftest.py` guards (`tests/conftest
    .py:258`, `:316`), cross-ref to the full
    post-mortem doc, and the grep-note: "if a
    multi-subproc tractor test hangs, `pytest -s`
    first, conc-anal second"
- `.claude/skills/run-tests/SKILL.md`: new
  "Section 9: The pytest-capture hang pattern
  (CHECK THIS FIRST)" with symptom / cause /
  pre-existing guards to grep / three-step debug
  recipe (try `-s`, lower loglevel, redirect
  stdout/stderr) / signature of this bug vs. a
  real code hang / historical reference

Cost several investigation sessions before the
capture-pipe issue surfaced — it was masked by
deeper cascade deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume. Lesson: **grep this
pattern first when any multi-subproc tractor test
hangs under default pytest but passes with `-s`.**

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 23:22:40 -04:00
Gud Boi eceed29d4a Pin forkserver hang to pytest `--capture=fd`
Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.

Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.

Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.

Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.

Deats,
- `subint_forkserver_test_cancellation_leak_issue
  .md`: new section "Update — VERY late: pytest
  capture pipe IS the final gate" w/ DIAG timeline
  showing `trio.run` fully returns, diagnosis of
  pipe-fill mechanism, retrospective on the
  earlier wrong ruling-out, and fix direction
  (redirect subactor stdout/stderr to `/dev/null`
  in fork-child prelude, conditional on
  pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
  rewritten to describe the capture-pipe gate
  specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
  orphan-SIGINT test regresses back to xfail.
  Previously passed after the FD-hygiene fix,
  but the new `wait_for_no_more_peers(
  move_on_after=3.0)` bound in `async_main`'s
  teardown added up to 3s latency, pushing
  orphan-subactor exit past the test's 10s poll
  window. Real fix: faster orphan-side teardown
  OR extend poll window to 15s

No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 23:18:14 -04:00