tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	c4885f9d99	Drop global mutation of `_PROC_SPAWN_WAIT` In top level `daemon`-fixture that is.. Use a local `bg_daemon_spawn_delay` instead of mutating the module-level `_PROC_SPAWN_WAIT` — previously each `daemon` fixture invocation would permanently add 1.6s (UDS) or 1s (CI) to the global, inflating delays across the session. Also, emit a `test_log.warning()` when verbose loglevel is silently reduced to `'info'`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 16:23:50 -04:00
Gud Boi	0ef549fadb	Add `tractor.trionics.patches` subpkg + first fix With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which can busy-loop due to lack of handling `EOF`. New `tractor.trionics.patches` subpkg housing defensive monkey-patches for upstream `trio` bugs we've encountered while running `tractor` — particularly as of recent, fork-survival edge cases that haven't been filed/fixed upstream yet. Each patch is idempotent, version-gated via `is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the upstream release whose adoption allows deletion. Subpkg layout + per-patch contract documented in `tractor/trionics/patches/README.md` — `apply()` / `is_needed()` / `repro()` API, registry pattern via `_PATCHES` in `__init__.py`, single-call entry point `apply_all()`. First patch, `_wakeup_socketpair`: - `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN). - under `fork()`-spawning backends the COW-inherited socketpair fds & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair` instance whose write-end is closed, and `drain()` then spins forever in C with no Python checkpoints, - this obviously burns 100% CPU and no signal delivery. Standalone repro: from trio._core._wakeup_socketpair import WakeupSocketpair ws = WakeupSocketpair() ws.write_sock.close() ws.drain() # spins forever Patch is one-line — break the drain loop on b'' EOF. Manifested as two distinct test failures: - `tests/test_multi_program.py::test_register_duplicate_name` hung at 100% CPU on the busy-loop directly (fork child's worker thread) - `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`, both threads parked in `epoll_wait`, no TCP connect-back to parent ever happened. Same patch fixes both. Restored 99.7% pass rate on full suite under `--spawn-backend=main_thread_forkserver` (was hanging indefinitely before). Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE any trio runtime init. Harmless on non-fork backends. Conc-anal write-ups, including strace + py-spy evidence: - `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md` - `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md` Regression tests in `tests/trionics/test_patches.py`: each test asserts (a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b) the patch fixes it with a SIGALRM wall-clock cap so a regression hangs loud instead of silently. TODO: - [ ] file the upstream `python-trio/trio` issue + PR. - [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue body's evidence section. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 12:18:03 -04:00
Gud Boi	5a9926fc32	Adjust `test_shield_pause` for capsys backends Under `main_thread_forkserver` the bootstrapping hook switches to `--capture=sys`, so subactor fd-level output (tree dumps, zombie-reaper msgs) isn't captured per-test by pexpect. Gate those expects behind a `no_capfd` check so the test passes on both capture modes. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-01 19:08:55 -04:00
Gud Boi	72a0465c52	Default `--ll` to `None` in test harness Only override `tractor.log._default_loglevel` when the flag is explicitly passed — lets per-spawn and per-example `loglevel` kwargs take effect instead of being clobbered by the hard-coded `'ERROR'` default. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-01 00:18:18 -04:00
Gud Boi	9431a81d37	Update debug examples + harden `test_debugger` Pass explicit `loglevel` to `spawn()` calls in `test_debugger` tests — required for pexpect pattern matching now that examples no longer hard-code log levels. Also, - make `expect()` return the decoded `before` str. - add `start_method` param + fork-backend timeout slack (+4s) in nested-error test. - clean up debug examples: drop unused loglevels, rename `n` -> `an`, fix docstrings, add TODO comments for tpt parametrize via osenv. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-01 00:13:22 -04:00
Gud Boi	fc2e298a29	Update `sync_bp` + tighten `test_pause_from_sync` Add `disable_pdbp_color()` to the `sync_bp` example to suppress pygments prompt coloring when `PYTHON_COLORS=0` — makes pexpect pattern matching deterministic. Deats, - set `loglevel='pdb'` in both script + test spawn. - disable `enable_stack_on_sig` in example, assert no `stackscope` output in test. - update `attach_patts` keys/values with `\|_<Task` / `\|_<Thread` / `\|_('subactor'` prefixes to match actual tree-dump format. - add call-site patterns (`tractor.pause_from_sync()` `tractor.pause()`, `breakpoint(hide_tb=...)`). - trim trailing `\n` from `Lock.repr()` output. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-30 20:54:50 -04:00
Gud Boi	e2b790a70d	Fix `SIGUSR1` tree-dump ordering in `_stackscope` Factor the sub-actor relay loop out of `dump_tree_on_sig()` into `_relay_sig_to_subactors()` and chain both dump + relay in a single `run_sync_soon` callback (`_dump_then_relay`) so the parent's task-tree flushes BEFORE any sub receives the signal — fixes a hierarchical-ordering race where subs could dump ahead of the parent in the muxed pty stream. Also, - gate file/tty sink writes behind `write_file` + `write_tty` params on `dump_task_tree()`. - use `actor.aid.uid` instead of deprecated `.uid`. - update `test_shield_pause` expects to match the new sequential parent -> relay-log -> sub ordering. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-30 19:35:55 -04:00
Gud Boi	61d4525137	Add `pytest_load_initial_conftests()` for `--capture=` Move `--capture=sys` enforcement from a static ini flag to a `pytest_load_initial_conftests()` bootstrap hook that dynamically flips capture mode only when a fork-based spawner (like `main_thread_forkserver`) is detected; non-fork backends keep `--capture=fd`. Also, - load `tractor._testing.pytest` via `-p` in ini (bc bootstrapping hooks must register before conftest `pytest_plugins` runs). - register `_reap` as sub-plugin via `pytest_plugins` tuple in `._testing.pytest`. - drop now-duplicate reap fixtures (already in `_reap` per `1cdc7fb3`). - rename `tractor_enable_stackscope` dest -> `enable_stackscope` and pop env var on disable. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-30 19:29:51 -04:00
Gud Boi	486249d74f	Allow per-call `start_method`/`loglevel` overrides In `tests/devx/conftest.py::spawn`, refactor the fixture-internal closures so consumer tests can pass explicit `start_method`/`loglevel` to each `_spawn()` invocation rather than only inheriting the fixture- scoped parametrize values. Deats, - promote `set_spawn_method()` and `set_loglevel()` to take their respective values as fn params (vs closing over the fixture-scope vars). - give `_spawn()` `start_method=start_method` and `loglevel: str\|None = None` kwargs so callers override one-off without re-parametrizing the suite. NOTE: this drops the implicit fixture- scoped `loglevel` forward — `_spawn()` callers now must pass `loglevel=...` explicitly. - TODO: figure out how `--ll <level>` should map to the default (currently `None` → uses env-var or tractor default). - add a docstring to `_spawn()` so its role as the consumer-facing closure is obvious from `help()`. Also, - `assert_before()` now returns the `.before` output on success (was `None`); add a one-line docstring describing the new return contract. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-30 14:17:41 -04:00
Gud Boi	b7115fc875	Drop test-local timeouts, +`sync_pause` to dev In `pyproject.toml`, - include the `sync_pause` group from `dev`, so dev installs ship `greenback` for `pause_from_sync()`. Comment out per-test `@pytest.mark.timeout(...)` markers in, - `tests/devx/test_debugger.py` - `tests/discovery/test_registrar.py` - `tests/spawn/test_main_thread_forkserver.py` - `tests/spawn/test_subint_cancellation.py` - `tests/test_advanced_streaming.py` - `tests/test_cancellation.py` The global cap was already dropped (`3c366cac`); these were the leftover per-test caps which now block interactive `pdb` flows under the new spawn backends. In `uv.lock`, - pull `greenback` into the resolved `dev` deps (per the `sync_pause` include above). - catch up the prior `xonsh` editable→PyPI switch (from the `pyproject.toml` `tool.uv.sources` edit). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-29 18:10:40 -04:00
Gud Boi	208e7c0926	Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars Add env-var overrides inside `._root.open_root_actor()` so devs/test-runs can swap the actor-spawn backend or crank console verbosity without touching application code. In `._root.open_root_actor()`, - read `TRACTOR_LOGLEVEL` early, overriding any caller-passed `loglevel` and stashing an `env_ll_report` to emit once the console log is set up. - pull the `loglevel` fallback (`or _default_loglevel`) and `log.get_console_log()` init up so the env-var report routes through tractor's own logger. - read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed `start_method` and warn-logging when the env-var clobbers an explicit caller value. Wire the same vars through `tests/devx/conftest.py::spawn`, - request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL` and `TRACTOR_SPAWN_METHOD` in `os.environ` before each `pexpect.spawn()` (inherited by the example subproc). - expand `supported_spawners` to include `main_thread_forkserver` and `subint_forkserver` bc example scripts no longer need per-script CLI plumbing. - pop both vars in fixture teardown so a leaked value can't re-route a later in-process tractor test's spawn-backend or loglevel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-29 17:29:38 -04:00
Gud Boi	2917b74ba4	Add todo for running `test_debugger` suite on forkserver spawner	2026-04-29 12:49:36 -04:00
Gud Boi	383b0fdd75	Backend-aware `fail_after` in pub/sub test Mirror `060f7d24`'s pattern (backend-aware timeout in `maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard `trio.fail_after` cap. Fork-based backends pay per-spawn fork+IPC-handshake cost which stacks over `cpus - 1` sequential `n.run_in_actor()` calls; empirically 12s flakes on `main_thread_forkserver` under UDS cross-pytest contention (#451 / #452). Defaults: - `main_thread_forkserver` → 30s - everything else → 12s (unchanged) Hoist the timeout-pick out of the `main()` closure so the dispatch happens once in the trio task rather than re-evaluating per spawn. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-29 10:28:48 -04:00
Gud Boi	060f7d24c4	Backend-aware timeout in `maybe_expect_raises` Default `timeout` from `int = 3` → `int\|None = None`; when unset, pick a backend-aware value. Fork-based backends (`main_thread_forkserver`) need real headroom bc actor spawn + IPC ctx-exit + msg-validation error path is much heavier than under `trio` backend — especially under cross-pytest-stream contention (#451). Defaults: - `main_thread_forkserver` → 30s - everything else → 3s (unchanged) Empirical flake history that motivated 30s as the floor on fork backends (all from `test_basic_payload_spec`): - 3s → all-valid variant flaked w/ `TooSlowError` - 8s → `invalid-return` variant flaked w/ `Cancelled` (surfaced instead of `MsgTypeError` bc the outer `fail_after` fired mid-error-path) - 15s → flaked under cross-pytest-stream contention 30s gives plenty of headroom while still failing-loud on a genuine hang. Callers can opt out by passing an explicit `timeout=` kw. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-29 10:21:56 -04:00
Gud Boi	530160fa69	Use `trio.fail_after` cap in `test_dynamic_pub_sub` Drop `@pytest.mark.timeout(...)` for the per-test wall-clock cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)` inside `main()` instead. Both pytest-timeout enforcement modes are incompatible with trio under fork-based backends: - `method='signal'` (SIGALRM) synchronously raises `Failed` in trio's main thread mid-`epoll.poll()`, leaving `GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got abandoned") so EVERY subsequent `trio.run()` in the same pytest process bails with `RuntimeError: Attempted to call run() from inside a run()` — full-session poison. - `method='thread'` calls `_thread.interrupt_main()` which can let the KBI escape trio's `KIManager` under fork- cascade teardown races and bubble out of pytest entirely — kills the whole session. `trio.fail_after()` keeps cancellation inside the trio loop: - Raises `TooSlowError` cleanly through the open-nursery's cancel cascade. - Doesn't disturb any out-of-band signal/thread state. - Failure stays scoped to the single test — no cross-test global state corruption either way. Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub` go from 5/10 fail (with global-state poison) to 3/10 fail (no poison, all sibling tests still pass). The ~30% remaining flake rate is a genuine fork-cancel-cascade hang — separate from this fix but no longer contaminates. Module-level NOTE comment explains the rationale so future readers don't re-introduce the bug. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 23:25:04 -04:00
Gud Boi	b376eb0332	Add opt-in `reap_subactors_per_test` fixture Function-scoped, NON-autouse zombie-subactor reaper for modules whose teardown is known-leaky enough to cascade- fail every following test in a session. Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The session-scoped one fires at session end — too late to save tests that follow a hung/leaky test in the suite. The new fixture, opted into via `pytestmark = pytest.mark.usefixtures(...)`, runs between tests in a problem-module so a leftover subactor from test N can't squat on registrar ports / UDS paths / shm segments needed by tests N+1, N+2, ... Intentionally NOT autouse — the fixture's presence on a module signals "this module's teardown leaks; please root-cause instead of relying forever on cleanup". A visibility-vs-convenience trade picked in favor of the former. Apply to `tests/test_infected_asyncio.py` since both recent full-suite runs (parallel-tpt-proto + TCP-only) showed the cascade originating in this file's KBI- and SIGINT-flavored tests under `main_thread_forkserver`. Module-comment names the specific offenders so future de-flake work has a starting point. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 21:41:02 -04:00
Gud Boi	cbdf1eb6db	Guard `subint_forkserver` stub against re-alias Add `test_subint_forkserver_key_errors_cleanly` — a tn-tier regression guard that pins down the variant-2 reservation contract: the `'subint_forkserver'` key in `_spawn._methods` MUST raise `NotImplementedError` today, not silently dispatch to `main_thread_forkserver_proc`. The transient alias-state existed briefly during the rename (commit `57dae0e4`'s "Split forkserver backend into variant 1/2 mods" landed the alias; `5e83881f` flipped it to the stub). Without a guard, a future refactor could easily re-collapse the two keys back to a single coro and silently break the variant-1 / variant-2 contract. Also asserts the stub's error msg surfaces the two pointers an operator hitting it actually needs: - `'main_thread_forkserver'` — the working backend they prolly meant, - `'msgspec#1026'` — the upstream blocker that has to land before variant-2 can ship. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 20:06:44 -04:00
Gud Boi	205382a39b	Sweep `subint_forkserver` → `main_thread_forkserver` in code After the variant-1 / variant-2 backend split, update remaining string-match refs to the variant-1 backend so user-visible gates + skip-marks + comments name the working backend correctly: - `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include `main_thread_forkserver`, drop the stub-only `subint_forkserver` entry. - `tests/test_spawning.py::test_loglevel_propagated_to_subactor`: capfd-skip flips to `main_thread_forkserver`. - `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`: xfail-condition flips to `main_thread_forkserver`. - `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`" reason-text since the `mp.SharedMemory(track=False)` + resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass; the skip-mark only fires on plain `subint` now. - Comment / docstring sweep: `runtime._state`, `runtime._runtime`, `_testing.pytest`, `_subint.py`, `pyproject.toml`, `test_cancellation.py`, `test_registrar.py` — refs to variant-1 backend updated. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 19:55:37 -04:00
Gud Boi	9f0709eee2	Migrate test/smoketest imports + rename test file Rename `tests/spawn/test_subint_forkserver.py` → `test_main_thread_forkserver.py` and migrate its imports + internal refs to the new canonical names: - `fork_from_worker_thread`, `wait_child` → from `tractor.spawn._main_thread_forkserver`. - `run_subint_in_worker_thread` → still from `_subint_forkserver` (variant-2 primitive). - Module docstring + tier-3 fixture + the `*_spawn_basic` test fn renamed for variant-1-honesty. - Orphan-harness subprocess argv flipped from `'subint_forkserver'` → `'main_thread_forkserver'`. `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` imports split the same way. `tractor/spawn/_subint_forkserver.py` drops the backward- compat re-exports of the fork primitives — the only consumers (test file + smoketest) now import from `_main_thread_forkserver` directly. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 19:47:44 -04:00
Gud Boi	5e83881f10	Add `subint_forkserver_proc` stub, flip dispatch, prune Reduce `_subint_forkserver.py` to its variant-2 placeholder shape: - Add `subint_forkserver_proc` async stub raising `NotImplementedError` with a redirect msg pointing at the working variant-1 backend (`main_thread_forkserver`), jcrist/msgspec#1026 (upstream PEP 684 blocker), and #379 (subint umbrella). - `tractor.spawn._spawn._methods['subint_forkserver']` now dispatches to the stub instead of aliasing the variant-1 coroutine — `--spawn-backend=subint_forkserver` errors cleanly. - Drop now-dead module-scope: `ChildSigintMode` / `_DEFAULT_CHILD_SIGINT` defs, `_has_subints` try/except (replaced with import from `._subint`), unused imports (`partial`, `Literal`, `sys`, msgtypes/pretty_struct, `current_actor`, `cancel_on_completion`/`soft_kill`, `_server` TYPE_CHECKING). - Backward-compat re-exports of fork primitives kept until the follow-up commit migrates external test imports. - `tests/spawn/test_subint_forkserver.py::forkserver_spawn_method` fixture: flip hardcoded `'subint_forkserver'` → `'main_thread_forkserver'` so the test still exercises the working backend (full file rename comes in the test-import migration commit). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 19:36:08 -04:00
Gud Boi	66f1941f46	Wire `reg_addr` into `test_context_stream_semantics` Same wire-up pattern as the prior `test_dynamic_pub_sub` commit: each test that already pulled in `debug_mode` now also pulls in `reg_addr` and passes `registry_addrs=[reg_addr]` into `tractor.open_nursery()`, so the suite's standard registry-addr conventions apply. Tests touched: - `test_started_misuse` - `test_simple_context` - `test_parent_cancels` - `test_one_end_stream_not_opened` - `test_maybe_allow_overruns_stream` - `test_ctx_with_self_actor` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 13:52:28 -04:00
Gud Boi	9b05f659b3	Wire `test_dynamic_pub_sub` to standard fixtures Pull in the `reg_addr`, `debug_mode`, and `test_log` fixtures so this test follows the same conventions as the rest of the suite: - pass `registry_addrs=[reg_addr]` + `debug_mode` into `tractor.open_nursery()` (so `--tpdb` etc work). - after the `pytest.raises` block, add `assert err` + `test_log.exception('Timed out AS EXPECTED')` so the expected timeout is logged explicitly instead of swallowed. Also, - drop whitespace-only blank lines around the `subs` param of `consumer()` and `ctx` param of `one_task_streams_and_one_handles_reqresp()`. - promote `test_sigint_both_stream_types`'s one-line docstring to multi-line form. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 12:59:00 -04:00
Gud Boi	65fcfbf224	Bump `test_stale_entry_is_deleted`'s timeout to 30 Seems that when run in-suite it delays more then the so-measured "happy path" timing; better to have no suite-global interruption then asserting a fast single test's run.	2026-04-27 11:46:45 -04:00
Gud Boi	aa3e230926	Fix `SharedMemory` under `subint_forkserver` Implements the resolution described in c99d475d's `subint_forkserver_mp_shared_memory_issue.md` (now updated with the resolution post-mortem). Two-part fix that side-steps `mp.resource_tracker` entirely rather than try to make it fork-safe — turns out that's both simpler AND more correct given tractor already SC-manages allocation lifetimes. Deats, - `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches now run unconditionally: * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op `ManTracker` subclass (empty `register` / `unregister` / `ensure_running`). * return `partial(SharedMemory, track=False)` for the per-allocation opt-out. * belt + suspenders: even if something dodges the wrapper, the singleton can't talk to the inherited (broken) parent fd. - `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional skip of the unlink-callback; install a `try_unlink()` wrapper that swallows `FileNotFoundError` (sibling-already-cleaned race in shared-key setups). Without `mp.resource_tracker` doing it for us, we own the unlink — `actor. lifetime_stack` is the right place since tractor already controls actor lifecycle. - `tests/test_shm.py`: uncomment-out `subint_forkserver` from the module-level skip- list (tests pass now). Inline comment cross-refs the two `_mp_bs` / `_shm` workarounds. - `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy rewrite — flips status from "open / unresolvable in tractor" to "resolved, kept as decision record". Adds Resolution section, "Why this is the right call" rationale (mp tracker is widely criticized; tractor already owns lifecycle), trade-offs (crash-leaked segments, lost mp leak warning), verification (7 passed under both `subint_forkserver` and `trio` backends), and upstream issue links (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 10:51:28 -04:00
Gud Boi	c99d475d03	Document `SharedMemory` × `subint_forkserver` incompat New `ai/conc-anal/` doc: `mp.SharedMemory` is fork-without-exec unsafe — child inherits parent's `resource_tracker` fd → EBADF on first shm op; leaked `/shm_list` cascades `FileExistsError` across parametrize variants. Canonical CPython issue class, NOT a tractor bug. Includes two longer-term mitigation paths (reset inherited tracker fd vs migrate off `mp.shared_memory`). Also, update `tests/test_shm.py`: - comment out `subint_forkserver` from skip list - rewrite reason with precise failure-mode descriptions + link to the analysis doc (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-26 20:13:24 -04:00
Gud Boi	44bdb1697c	Tighten orphan-SIGINT xfail to `strict=True` Re-classify `test_orphaned_subactor_sigint_cleanup_DRAFT` from flakey-env-sensitive (`strict=False` w/ "passes in isolation, flakey in full suite") to a hard known-gap (`strict=True`) with the orphan-SIGINT hang as the documented cause. The previous framing ("env pollution") let the test silently pass when ordering happened to favor it; the new framing forces an XPASS-as-FAIL the moment the underlying gap is actually closed, so we can drop the mark intentionally instead of accidentally. Reason text + leading `# Known-gap test —` comment both point at `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` for the full diagnosis. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-24 22:48:35 -04:00
Gud Boi	2ca0f41e61	Skip `test_loglevel_propagated_to_subactor` on subint forkserver too	2026-04-24 21:47:46 -04:00
Gud Boi	b350aa09ee	Wire `reg_addr` through infected-asyncio tests Continues the hygiene pattern from `de601676` (cancel tests) into `tests/test_infected_asyncio.py`: many tests here were calling `tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing on the default `:1616` registry across sessions. Thread the session-unique `reg_addr` through so leaked or slow-to-teardown subactors from a prior test can't cross-pollute. Deats, - add `registry_addrs=[reg_addr]` to `open_nursery()` calls in suite where missing. - `test_sigint_closes_lifetime_stack`: - add `reg_addr`, `debug_mode`, `start_method` fixture params - `delay` now reads the `debug_mode` param directly instead of calling `tractor.debug_mode()` (fires slightly earlier in the test lifecycle) - sanity assert `if debug_mode: assert tractor.debug_mode()` after nursery open - new print showing SIGINT target (`send_sigint_to` + resolved pid) - catch `trio.TooSlowError` around `ctx.wait_for_result()` and conditionally `pytest.xfail` when `send_sigint_to == 'child' and start_method == 'subint_forkserver'` — the known orphan-SIGINT limitation tracked in `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` - parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-24 20:26:25 -04:00
Gud Boi	d6e70e9de4	Import-or-skip `.devx.` tests requiring `greenback` Which is for sure true on py3.14+ rn since `greenlet` didn't want to build for us (yet).	2026-04-24 17:39:13 -04:00
Gud Boi	4c133ab541	Default `pytest` to use `--capture=sys` Lands the capture-pipe workaround from the prior cluster of diagnosis commits: switch pytest's `--capture` mode from the default `fd` (redirects fd 1,2 to temp files, which fork children inherit and can deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd 1,2 left alone). Trade-off documented inline in `pyproject.toml`: - LOST: per-test attribution of raw-fd output (C-ext writes, `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI capture, just not per-test-scoped in the failure report. - KEPT: `print()` + `logging` capture per-test (tractor's logger uses `sys.stderr`). - KEPT: `pytest -s` debugging behavior. This allows us to re-enable `test_nested_multierrors` without skip-marking + clears the class of pytest-capture-induced hangs for any future fork-based backend tests. Deats, - `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of rationale comment cross-ref'ing the post-mortem doc - `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')` from `test_nested_ multierrors` — no longer needed. * file-level `pytestmark` covers any residual. - `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail mark loosened from `strict=True` to `strict=False` + reason rewritten. * it passes in isolation but is session-env-pollution sensitive (leftover subactor PIDs competing for ports / inheriting harness FDs). * tolerate both outcomes until suite isolation improves. - `test_shm`: extend the existing `skipon_spawn_backend('subint', ...)` to also skip `'subint_forkserver'`. * Different root cause from the cancel-cascade class: `multiprocessing.SharedMemory`'s `resource_tracker` + internals assume fresh- process state, don't survive fork-without-exec cleanly - `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test (unrelated to forkserver; just a flaky-under-load bump). - `tractor.spawn._subint_forkserver`: inline comment-only future-work marker right before `_actor_child_main()` describing the planned conditional stdout/stderr-to-`/dev/null` redirect for cases where `--capture=sys` isn't enough (no code change — the redirect logic itself is deferred). EXTRA NOTEs ----------- The `--capture=sys` approach is the minimum- invasive fix: just a pytest ini change, no runtime code change, works for all fork-based backends, trade-offs well-understood (terminal-level capture still happens, just not pytest's per-test attribution of raw-fd output). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-24 14:17:23 -04:00
Gud Boi	eceed29d4a	Pin forkserver hang to pytest `--capture=fd` Sixth and final diagnostic pass — after all 4 cascade fixes landed (FD hygiene, pidfd wait, `_parent_chan_cs` wiring, bounded peer-clear), the actual last gate on `test_nested_multierrors[subint_forkserver]` turned out to be pytest's default `--capture=fd` stdout/stderr capture, not anything in the runtime cascade. Empirical result: `pytest -s` → test PASSES in 6.20s. Default `--capture=fd` → hangs forever. Mechanism: pytest replaces the parent's fds 1,2 with pipe write-ends it reads from. Fork children inherit those pipes (since `_close_inherited_fds` correctly preserves stdio). The error-propagation cascade in a multi-level cancel test generates 7+ actors each logging multiple `RemoteActorError` / `ExceptionGroup` tracebacks — enough output to fill Linux's 64KB pipe buffer. Writes block, subactors can't progress, processes don't exit, `_ForkedProc.wait` hangs. Self-critical aside: I earlier tested w/ and w/o `-s` and both hung, concluding "capture-pipe ruled out". That was wrong — at that time fixes 1-4 weren't all in place, so the test was failing at deeper levels long before reaching the "produce lots of output" phase. Once the cascade could actually tear down cleanly, enough output flowed to hit the pipe limit. Order-of- operations mistake: ruling something out based on a test that was failing for a different reason. Deats, - `subint_forkserver_test_cancellation_leak_issue .md`: new section "Update — VERY late: pytest capture pipe IS the final gate" w/ DIAG timeline showing `trio.run` fully returns, diagnosis of pipe-fill mechanism, retrospective on the earlier wrong ruling-out, and fix direction (redirect subactor stdout/stderr to `/dev/null` in fork-child prelude, conditional on pytest-detection or opt-in flag) - `tests/test_cancellation.py`: skip-mark reason rewritten to describe the capture-pipe gate specifically; cross-refs the new doc section - `tests/spawn/test_subint_forkserver.py`: the orphan-SIGINT test regresses back to xfail. Previously passed after the FD-hygiene fix, but the new `wait_for_no_more_peers( move_on_after=3.0)` bound in `async_main`'s teardown added up to 3s latency, pushing orphan-subactor exit past the test's 10s poll window. Real fix: faster orphan-side teardown OR extend poll window to 15s No runtime code changes in this commit — just test-mark adjustments + doc wrap-up. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 23:18:14 -04:00
Gud Boi	506617c695	Skip-mark + narrow `subint_forkserver` cancel hang Two-part stopgap for the still-hanging `test_nested_multierrors[subint_forkserver]`: 1. Skip-mark the test via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The reason string cross-refs the conc-anal doc for full context. 2. Update the conc-anal doc (`subint_forkserver_test_cancellation_leak_issue.md`) with the empirical state after the three nested- cancel fix commits (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan shield break) landed, narrowing the remaining hang from "everything broken" to "peer-channel loops don't exit on `service_tn` cancel". Deats from the DIAGDEBUG instrumentation pass, - 80 `process_messages` ENTERs, 75 EXITs → 5 stuck - ALL 40 `shield=True` ENTERs matched EXIT — the `_parent_chan_cs.cancel()` wiring from `57935804` works as intended for shielded loops. - the 5 stuck loops are all `shield=False` peer- channel handlers in `handle_stream_from_peer` (inbound connections handled by `stream_handler_tn`, which IS `service_tn` in the current config). - after `_parent_chan_cs.cancel()` fires, NEW shielded loops appear on the session reg_addr port — probably discovery-layer reconnection; doesn't block teardown but indicates the cascade has more moving parts than expected. The remaining unknown: why don't the 5 peer-channel loops exit when `service_tn.cancel_scope.cancel()` fires? They're not shielded, they're inside the service_tn scope, a standard cancel should propagate through. Some fork-config-specific divergence keeps them alive. Doc lists three follow-up experiments (stackscope dump, side-by-side `trio_proc` comparison, audit of the `tractor/ipc/_server.py:448` `except trio.Cancelled:` path). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	c20b05e181	Use `pidfd` for cancellable `_ForkedProc.wait` Two coordinated improvements to the `subint_forkserver` backend: 1. Replace `trio.to_thread.run_sync(os.waitpid, ..., abandon_on_cancel=False)` in `_ForkedProc.wait()` with `trio.lowlevel.wait_readable(pidfd)`. The prior version blocked a trio cache thread on a sync syscall — outer cancel scopes couldn't unwedge it when something downstream got stuck. Same pattern `trio.Process.wait()` and `proc_waiter` (the mp backend) already use. 2. Drop the `@pytest.mark.xfail(strict=True)` from `test_orphaned_subactor_sigint_cleanup_DRAFT` — the test now PASSES after `0cd0b633` (fork-child FD scrub). Same root cause as the nested-cancel hang: inherited IPC/trio FDs were poisoning the child's event loop. Closing them lets SIGINT propagation work as designed. Deats, - `_ForkedProc.__init__` opens a pidfd via `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+) - `wait()` parks on `trio.lowlevel.wait_readable()`, then non-blocking `waitpid(WNOHANG)` to collect the exit status (correct since the pidfd signal IS the child-exit notification) - `ChildProcessError` swallow handles the rare race where someone else reaps first - pidfd closed after `wait()` completes (one-shot semantics) + `__del__` belt-and-braces for unexpected-teardown paths - test docstring's `@xfail` block replaced with a `# NOTE` comment explaining the historical context + cross-ref to the conc-anal doc; test remains in place as a regression guard The two changes are interdependent — the cancellable `wait()` matters for the same nested- cancel scenarios the FD scrub fixes, since the original deadlock had trio cache workers wedged in `os.waitpid` swallowing the outer cancel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	1af2121057	Wire `reg_addr` through leaky cancel tests Stopgap companion to `d0121960` (`subint_forkserver` test-cancellation leak doc): five tests in `tests/test_cancellation.py` were running against the default `:1616` registry, so any leaked `subint-forkserv` descendant from a prior test holds the port and blows up every subsequent run with `TooSlowError` / "address in use". Thread the session-unique `reg_addr` fixture through so each run picks its own port — zombies can no longer poison other tests (they'll only cross-contaminate whatever happens to share their port, which is now nothing). Deats, - add `reg_addr: tuple` fixture param to: - `test_cancel_infinite_streamer` - `test_some_cancels_all` - `test_nested_multierrors` - `test_cancel_via_SIGINT` - `test_cancel_via_SIGINT_other_task` - explicitly pass `registry_addrs=[reg_addr]` to the two `open_nursery()` calls that previously had no kwargs at all (in `test_cancel_via_SIGINT` and `test_cancel_via_SIGINT_other_task`) - add bounded `@pytest.mark.timeout(7, method='thread')` to `test_nested_multierrors` so a hung run doesn't wedge the whole session Still doesn't close the real leak — the `subint_forkserver` backend's `_ForkedProc.kill()` is PID-scoped not tree-scoped, so grandchildren survive teardown regardless of registry port. This commit is just blast-radius containment until that fix lands. See `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	1e357dcf08	Mv `test_subint_cancellation.py` to `tests/spawn/` subpkg Also, some slight touchups in `.spawn._subint`.	2026-04-23 18:48:34 -04:00
Gud Boi	f5f37b69e6	Shorten some timeouts in `subint_forkserver` suites	2026-04-23 18:48:34 -04:00
Gud Boi	a72deef709	Refine `subint_forkserver` orphan-SIGINT diagnosis Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is not "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	76605d5609	Add DRAFT `subint_forkserver` orphan-SIGINT test Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT` documents an empirical SIGINT-delivery gap in the `subint_forkserver` backend: when the parent dies via `SIGKILL` (no IPC `Portal.cancel_actor()` possible) and `SIGINT` is sent to the orphan child, the child DOES NOT unwind — CPython's default `KeyboardInterrupt` is delivered to `threading.main_thread()`, whose tstate is dead in the post-fork child bc fork inherited the worker thread, not main. Trio running on the fork-inherited worker thread therefore never observes the signal. Marked `xfail(strict=True)` so the mark flips to XPASS→fail once the backend grows explicit SIGINT plumbing. Deats, - harness runs the failure-mode sequence out-of-process: 1. harness subprocess runs a fresh Python script that calls `try_set_start_method('subint_forkserver')` then opens a root actor + one `sleep_forever` subactor 2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers off harness `stdout` to confirm IPC handshake completed 3. `SIGKILL` the parent, `proc.wait()` to reap the zombie (otherwise `os.kill(pid, 0)` keeps reporting it alive) 4. assert the child survived the parent-reap (i.e. was actually orphaned, not reaped too) before moving on 5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)` every 100ms for up to 10s - supporting helpers: `_read_marker()` with per-proc bytes-buffer to carry partial lines across calls, `_process_alive()` liveness probe via `kill(pid, 0)` - Linux-only via `platform.system() != 'Linux'` skip — orphan-reparenting semantics don't generalize to other platforms - port offset (`reg_addr[1] + 17`) so the harness listener doesn't race concurrently-running backend tests - best-effort `finally:` cleanup: `SIGKILL` any still-alive pids + `proc.kill()` + bounded `proc.wait()` to avoid leaking orphans across the session Also, tier-4 header comment documents the cross-backend generalization path: applicable to any multi-process backend (`trio`, `mp_spawn`, `mp_forkserver`, `subint_forkserver`), NOT to plain `subint` (in-process subints have no orphan OS-child). Move path: lift harness into `tests/_orphan_harness.py`, parametrize on session `_spawn_method`, add `skipif _spawn_method == 'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	26914fde75	Wire `subint_forkserver` as first-class backend Promote `_subint_forkserver` from primitives-only into a registered spawn backend: `'subint_forkserver'` is now a `SpawnMethodKey` literal, dispatched via `_methods` to the new `subint_forkserver_proc()` target, feature-gated under the existing `subint`-family py3.14+ case, and selectable via `--spawn-backend=subint_forkserver`. Deats, - new `subint_forkserver_proc()` spawn target in `_subint_forkserver`: - mirrors `trio_proc()`'s supervision model — real OS subprocess so `Portal.cancel_actor()` + `soft_kill()` on graceful teardown, `os.kill(SIGKILL)` on hard-reap (no `_interpreters.destroy()` race to fuss over bc the child lives in its own process) - only real diff from `trio_proc` is the spawn mechanism: fork from a main-interp worker thread via `fork_from_worker_thread()` (off-loaded to trio's thread pool) instead of `trio.lowlevel.open_process()` - child-side `_child_target` closure runs `tractor._child._actor_child_main()` with `spawn_method='trio'` — the child is a regular trio actor, "subint_forkserver" names how the parent spawned, not what the child runs - new `_ForkedProc` class — thin `trio.Process`-compatible shim around a raw OS pid: `.poll()` via `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio cache thread, `.kill()` via `SIGKILL`, `.returncode` cached for repeat calls. `.stdin`/`.stdout`/`.stderr` are `None` (fork-w/o-exec inherits parent FDs; we don't marshal them) which matches `soft_kill()`'s `is not None` guards Also, new backend-tier test `test_subint_forkserver_spawn_basic` drives the registered backend end-to-end via `open_root_actor` + `open_nursery` + `run_in_actor` w/ a trivial portal-RPC round-trip. Uses a `forkserver_spawn_method` fixture to flip `_spawn_method`/`_ctx` for the test's duration + restore on teardown (so other session-level tests don't observe the global flip). Test module docstring reworked to describe the three tiers now covered: (1) primitive-level, (2) parent-trio-driven primitives, (3) full registered backend. Status: still-open work (tracked on `tractor#379`) doc'd inline in the module docstring — no cancel/hard-kill stress coverage yet, child-side subint-hosted root runtime still future (gated on `msgspec#563`), thread-hygiene audit pending the same unblock. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	25e400d526	Add trio-parent tests for `_subint_forkserver` New pytest module `tests/spawn/test_subint_forkserver.py` drives the forkserver primitives from inside a real `trio.run()` in the parent — the runtime shape tractor will actually use when we wire up a `subint_forkserver` spawn backend proper. Complements the standalone no-trio-in-parent `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`. Deats, - new test pkg `tests/spawn/` (+ empty `__init__.py`) - two tests, both `@pytest.mark.timeout(30, method='thread')` for the GIL-hostage safety reason doc'd in `ai/conc-anal/subint_sigint_starvation_issue.md`: - `test_fork_from_worker_thread_via_trio` — parent-side plumbing baseline. `trio.run()` off-loads forkserver prims via `trio.to_thread.run_sync()` + asserts the child reaps cleanly - `test_fork_and_run_trio_in_child` — end-to-end: forked child calls `run_subint_in_worker_thread()` with a bootstrap str that does `trio.run()` in a fresh subint - both tests wrap the inner `trio.run()` in a `dump_on_hang()` for post-mortem if the outer `pytest-timeout` fires - intentionally NOT using `--spawn-backend` — the tests drive the primitives directly rather than going through tractor's spawn-method registry (which the forkserver isn't plugged into yet) Also, rename `run_trio_in_subint()` → `run_subint_in_worker_thread()` for naming consistency with the sibling `fork_from_worker_thread()`. The action is really "host a subint on a worker thread", not specifically "run trio" — trio just happens to be the typical payload. Propagate the rename to the smoketest. Further, add a "TODO — cleanup gated on msgspec PEP 684 support" section to the `_subint_forkserver` module docstring: flags the dedicated-`threading.Thread` design as potentially-revisable once isolated-mode subints are viable in tractor. Cross-refs `msgspec#563` + `tractor#379` and points at an audit-plan conc-anal doc we'll add next. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	4b2a0886c3	Mark `subint`-hanging tests with `skipon_spawn_backend` Adopt the `@pytest.mark.skipon_spawn_backend('subint', reason=...)` marker (`a617b521`) across the suites reproducing the `subint` GIL-contention / starvation hang classes doc'd in `ai/conc-anal/subint_*_issue.md`. Deats, - Module-level `pytestmark` on full-file-hanging suites: - `tests/test_cancellation.py` - `tests/test_inter_peer_cancellation.py` - `tests/test_pubsub.py` - `tests/test_shm.py` - Per-test decorator where only one test in the file hangs: - `tests/discovery/test_registrar.py ::test_stale_entry_is_deleted` — replaces the inline `if start_method == 'subint': pytest.skip` branch with a declarative skip. - `tests/test_subint_cancellation.py ::test_subint_non_checkpointing_child`. - A few per-test decorators are left commented-in- place as breadcrumbs for later finer-grained unskips. Also, some nearby tidying in the affected files: - Annotate loose fixture / test params (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in `tests/conftest.py`, `tests/devx/conftest.py`, and `tests/test_cancellation.py`. - Normalize `"""..."""` → `'''...'''` docstrings per repo convention on a few touched tests. - Add `timeout=6` / `timeout=10` to `@tractor_test(...)` on `test_cancel_infinite_streamer` and `test_some_cancels_all`. - Drop redundant `spawn_backend` param from `test_cancel_via_SIGINT`; use `start_method` in the `'mp' in ...` check instead. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	985ea76de5	Skip `test_stale_entry_is_deleted` hanger with `subint`s	2026-04-23 18:47:49 -04:00
Gud Boi	189f4e3ffc	Wall-cap `subint` audit tests via `pytest-timeout` Add a hard process-level wall-clock bound on the two known-hanging subint-backend tests so an unattended suite run can't wedge indefinitely in either of the hang classes doc'd in `ai/conc-anal/`. Deats, - New `testing` dep: `pytest-timeout>=2.3`. - `test_stale_entry_is_deleted`: `@pytest.mark.timeout(3, method='thread')`. The `method='thread'` choice is deliberate — `method='signal'` routes via `SIGALRM` which is starved by the same GIL-hostage path that drops `SIGINT` (see `subint_sigint_starvation_issue.md`), so it'd never actually fire in the starvation case. - `test_subint_non_checkpointing_child`: same decorator, same reasoning (defense-in-depth over the inner `trio.fail_after(15)`). At timeout, `pytest-timeout` hard-kills the pytest process itself — that's the intended behavior here; the alternative is the suite never returning. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	4a3254583b	Doc `subint` backend hang classes + arm `dump_on_hang` Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	2ed5e6a6e8	Add `subint` cancellation + hard-kill test audit Lock in the escape-hatch machinery added to `tractor.spawn._subint` during the Phase B.2/B.3 bringup (issue #379) so future stdlib regressions or our own refactors don't silently re-introduce the mid-suite hangs. Deats, - `test_subint_happy_teardown`: baseline — spawn a subactor, one portal RPC, clean teardown. If this breaks, something's wrong unrelated to the hard-kill shields. - `test_subint_non_checkpointing_child`: cancel a subactor stuck in a non-checkpointing Python loop (`threading.Event.wait()` releases the GIL but never inserts a trio checkpoint). Validates the bounded-shield + daemon-driver-thread combo abandons the thread after `_HARD_KILL_TIMEOUT`. Every test is wrapped in `trio.fail_after()` for a deterministic per-test wall-clock ceiling (an unbounded audit would defeat itself) and arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump — pytest's stderr capture swallows `faulthandler` output by default. Gated via `pytest.importorskip('concurrent.interpreters')` and a module-level skip when `--spawn-backend` isn't `'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	03bf2b931e	Avoid skip `.ipc._ringbuf` import when no `cffi`	2026-04-23 18:47:49 -04:00
Gud Boi	d2ea8aa2de	Handle py3.14+ incompats as test skips Since we're devving subints we require the 3.14+ stdlib API and a couple compiled libs don't support it yet, namely: - `cffi`, which we're only using for the `.ipc._linux` eventfd stuff (now factored into `hotbaud` anyway). - `greenback`, which requires `greenlet` which doesn't seem to be wheeled yet * on nixos the sdist build was failing due to lack of `g++` which i don't care to figure out rn since we don't need `.devx` stuff immediately for this subints prototype. * [ ] we still need to adjust any dependent suites to skip. Adjust `test_ringbuf` to skip on import failure. Also project wide, - pin us to py 3.13+ in prep for last-2-minor-version policy. - drop `msgspec>=0.20.0`, the first release with py3.14 support.	2026-04-23 18:47:49 -04:00
Gud Boi	3867403fab	Scale `test_open_local_sub_to_stream` timeout by CPU factor Import and apply `cpu_scaling_factor()` from `conftest`; bump base from 3.6 -> 4 and multiply through so CI boxes with slow CPUs don't flake. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-16 20:03:32 -04:00
Gud Boi	ed65301d32	Fix misc bugs caught by Copilot review Deats, - use `proc.poll() is None` in `sig_prog()` to distinguish "still running" from exit code 0; drop stale `breakpoint()` from fallback kill path (would hang CI). - add missing `raise` on the `RuntimeError` in `async_main()` when no tpt bind addrs given. - clean up stale uid entries from the registrar `_registry` when addr eviction empties the addr list. - update `discovery.__init__` docstring to match the new eager `._multiaddr` import. - fix `registar` -> `registrar` typo in teardown report log msg. Review: PR #429 (Copilot) https://github.com/goodboy/tractor/pull/429 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-14 19:54:15 -04:00
Gud Boi	cd287c7e93	Fix `test_registrar_merge_binds_union` for UDS collision `get_random()` can produce the same UDS filename for a given pid+actor-state, so the "disjoint addrs" premise doesn't always hold. Gate the `len(bound) >= 2` assertion on whether the registry and bind addrs actually differ via `expect_disjoint`. Also, - drop unused `partial` import (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-14 19:54:15 -04:00

1 2 3 4 5 ...

687 Commits (c4885f9d9986c27d0853e7e3ccabe205daeed41c)