tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	9fb1c4ccc0	Mk `--capture` guard CI-aware w/ local warn Refactor `pytest_load_initial_conftests()` to split the fork-spawn × capture-mode check into two policies: - CI (`CI` env-var set): `pytest.exit(rc=2)` on mismatch — forces every matrix-row to declare `--capture=sys` explicitly. - local: `warnings.warn()` + continue — lets devs experiment with `--capture=fd` to validate fixes. Deats, - drop `_cap_fd_set` global; add `_CAPSYS_REQUIRED_SPAWNERS` frozenset for the spawner-name lookup - move inline comment wall → proper docstring w/ Background, Trade-off, Validation-policy sections - `maybe_xfail_for_spawner()` now takes `request: pytest.FixtureRequest` and reads `request.config.option.capture` instead of the `_cap_sys_passed_as_flag` global - recognize `tee-sys` as fork-safe (only `fd`-level capture deadlocks) - `set_fork_aware_capture()` returns the actual capture mode str from config, not a hardcoded `'sys'` - lift `import warnings` to module level (was duped inside `pytest_configure`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `255c9c3a7c`)	2026-06-09 23:24:18 -04:00
Gud Boi	c0f5bd2915	Mk per-test reap fixtures opt-in Rename `_track_orphaned_uds_per_test` and `_detect_runaway_subactors_per_test` to public names (drop `_` prefix), drop `autouse=True`. Tests that need per-test reap blame now opt in via `pytestmark = pytest.mark.usefixtures(...)`. Also, - reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper per pid. - add empty-`only_pids` fast-path in `find_runaway_subactors` to skip psutil import when no descendants were spawned. - extract `new_pids` intermediate var for clarity. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `e4953851de`)	2026-06-09 23:24:18 -04:00
Gud Boi	32a7ead862	Use single f-string per pid in runaway warning (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `086e9f2c07`)	2026-06-09 23:24:18 -04:00
Gud Boi	35e8880075	Add per-test runaway-subactor CPU detector to `_reap` New `find_runaway_subactors()` helper + autouse `_detect_runaway_subactors_per_test` fixture that samples `psutil.cpu_percent()` on descendants to catch tight-loop bugs (e.g. #452-class `recvfrom` on a closed socket). Checks both at setup (leftovers from a prior hung test) and teardown (spawned by this test). Intentionally does NOT kill the runaway — emits a loud warning with diag commands (`strace`, `lsof`, `ss`, `kill`) so the pid stays alive for hands-on investigation. Session-end reaper still SIGINT/SIGKILL survivors on normal exit. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `5cf0312c78`)	2026-06-09 23:24:18 -04:00
Gud Boi	eb89db81a5	Fix `maybe_override_capture` to not get invalid capX fixture names.. (cherry picked from commit `32e89c67ee`)	2026-06-09 23:24:18 -04:00
Gud Boi	dd1d6cd51e	Add fork-aware capture fixtures to `_testing.pytest` Extend the pytest plugin with helpers that detect and adapt to `--capture=sys` under fork-based spawners (`main_thread_forkserver`, `mp_forkserver`) where fd-capture causes hangs. Deats, - track `_cap_sys_passed_as_flag` + `_cap_fd_set` globals in `pytest_load_initial_conftests()`. - add `@pytest.hookimpl(tryfirst=True)` + re-parse args after appending `--capture=sys`. - `_is_forking_spawner()` predicate + fixture. - `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys but weren't passed `--capture=sys`. - `set_fork_aware_capture` fixture — returns the appropriate capture fixture per spawner backend based on `start_method: str` set via CLI. - wire `set_fork_aware_capture` into `tractor_test` wrapper's fixture injection. Also, - add `alert_on_finish` session fixture (terminal bell on completion; tho not sure it works fully..) - add `ids=` to `start_method` parametrize. - restore `default=False` on `--enable-stackscope`. - drop commented-out `--ll` option block; we will likely factor it to our plugin eventually however.. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `d549c72052`)	2026-06-09 23:24:18 -04:00
Gud Boi	82e25c442a	Add `pytest_load_initial_conftests()` for `--capture=` Move `--capture=sys` enforcement from a static ini flag to a `pytest_load_initial_conftests()` bootstrap hook that dynamically flips capture mode only when a fork-based spawner (like `main_thread_forkserver`) is detected; non-fork backends keep `--capture=fd`. Also, - load `tractor._testing.pytest` via `-p` in ini (bc bootstrapping hooks must register before conftest `pytest_plugins` runs). - register `_reap` as sub-plugin via `pytest_plugins` tuple in `._testing.pytest`. - drop now-duplicate reap fixtures (already in `_reap` per `1cdc7fb3`). - rename `tractor_enable_stackscope` dest -> `enable_stackscope` and pop env var on disable. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `61d4525137`)	2026-06-09 23:24:18 -04:00
Gud Boi	90c46288ad	Add `--uds`/`--uds-only` flags to `tractor-reap` Wire up `find_orphaned_uds()` + `reap_uds()` from `_reap` as a new phase-3 UDS sweep in the CLI script. Opt-in via `--uds` (run after proc reap + shm) or `--uds-only` (skip other phases). Also, - consolidate skip-proc-reap logic into a single `skip_proc_reap` bool covering both `--shm-only` and `--uds-only` - extend header docstring + usage examples (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `0996a83655`)	2026-06-09 23:24:18 -04:00
Gud Boi	053051535f	Add UDS orphan-sweep helpers + reap fixtures to `_reap` Extend the `_testing._reap` mod with UDS sock-file leak detection + cleanup, complementing the existing shm and subactor-process reaping: - `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`, `reap_uds()` — detect `<name>@<pid>.sock` files under `${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including the `1616` registry sentinel). - `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT lingering subactors, wait, SIGKILL survivors, then sweep orphaned UDS files. - `_track_orphaned_uds_per_test` fn-scoped autouse fixture: snapshot sock-file dir before/after each test, warn + reap new orphans to prevent cascade flakiness under `--tpt-proto=uds`. - `reap_subactors_per_test` opt-in fn-scoped fixture for modules with known-leaky teardown. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `1cdc7fb302`)	2026-06-09 23:24:18 -04:00
Gud Boi	4c50d610e6	Flip back to default `pytest` capture for CI (cherry picked from commit `22cdf15b73`)	2026-06-09 23:24:18 -04:00
Gud Boi	9a844b91f3	Drop global `pytest-timeout` cap from `pyproject.toml` `timeout = 200` was firing via SIGALRM (the default `method='signal'`) which synchronously raises `Failed` in trio's main thread mid-`epoll.poll()`, abandoning trio's runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half- installed. EVERY subsequent `trio.run()` in the same pytest session then bails with `RuntimeError: Attempted to call run() from inside a run()`. Empirical impact: a session that hits a single 200s hang cascades into 30-40 false-positive failures across every downstream test file that uses `trio.run`. Recent UDS run saw 1 real timeout (`test_unregistered_err_still_relayed`) poison 38 sibling tests with cascade-fails — a debugging nightmare. Same architectural bug we already documented in `tests/test_advanced_streaming.py::test_dynamic_pub_sub` (see its module-level NOTE) — both `pytest-timeout` enforcement modes are incompatible with trio under fork- based spawn backends. Now scoped session-wide. For tests that legitimately need a wall-clock cap, the canonical pattern is `with trio.fail_after(N):` INSIDE the test — trio's own `Cancelled` machinery cleanly unwinds the actor nursery without disturbing global state. For CI: rely on job-level wall-clock timeouts (e.g. GitHub Actions `timeout-minutes`) to abort genuinely-stuck suites. `pyproject.toml` comment block spells this all out so a future contributor doesn't reach back for `timeout =` and re-introduce the bug. ALSO, bump `xonsh` to at least `0.23.0` release. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `3c366cac13`) (factored: the xonsh pin/editable-source hunks already landed with the devenv segment)	2026-06-09 23:24:18 -04:00
Gud Boi	dcb00e5a8f	Return parent `pid: int` from new `reap_subactors_per_test` fixture (cherry picked from commit `f8178df0fd`)	2026-06-09 23:24:18 -04:00
Gud Boi	94d233a2f7	Add opt-in `reap_subactors_per_test` fixture Function-scoped, NON-autouse zombie-subactor reaper for modules whose teardown is known-leaky enough to cascade- fail every following test in a session. Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The session-scoped one fires at session end — too late to save tests that follow a hung/leaky test in the suite. The new fixture, opted into via `pytestmark = pytest.mark.usefixtures(...)`, runs between tests in a problem-module so a leftover subactor from test N can't squat on registrar ports / UDS paths / shm segments needed by tests N+1, N+2, ... Intentionally NOT autouse — the fixture's presence on a module signals "this module's teardown leaks; please root-cause instead of relying forever on cleanup". A visibility-vs-convenience trade picked in favor of the former. Apply to `tests/test_infected_asyncio.py` since both recent full-suite runs (parallel-tpt-proto + TCP-only) showed the cascade originating in this file's KBI- and SIGINT-flavored tests under `main_thread_forkserver`. Module-comment names the specific offenders so future de-flake work has a starting point. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `b376eb0332`)	2026-06-09 23:24:18 -04:00
Gud Boi	bac291dff4	Fix `_testing.addr.get_rando_addr` cross-process collisions Previously the random port was a default-arg expression (`_rando_port: str = random.randint(1000, 9999)`) — evaluated ONCE at module import time, making it a per-process singleton. Two parallel pytest sessions had a 1/9000 birthday-pair chance of picking the same port; when it hit, every `reg_addr`-using test in BOTH runs would cascade-fail with "Address already in use". Switch to per-call `random.randint()` salted with `os.getpid()` so: - within one session: two calls return distinct ports — e.g. `test_tpt_bind_addrs::bind-subset-reg` now actually gets two different reg addrs on the TCP backend (it was silently duplicating before), - across parallel sessions: pid salt biases each process's port choices apart, making cross-run collisions vanishingly rare. Drop the bogus `: str` annotation (was always `int`). UDS already gets per-process isolation via `UDSAddress.get_random()`'s `@<pid>` socket-path suffix, so no change needed there. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `7c5dd4d033`)	2026-06-09 23:24:18 -04:00
Gud Boi	7bcb30f6a6	Add `--shm` orphan sweep to `tractor-reap` Since `tractor.ipc._mp_bs.disable_mantracker()` turns off `mp.resource_tracker` entirely (see the conc-anal doc `subint_forkserver_mp_shared_memory_issue.md`), a hard-crashing actor can leave `/dev/shm/<key>` segments that nothing else GCs. New `tractor-reap` phase 2 sweeps them. Deats, - `tractor/_testing/_reap.py`: add `find_orphaned_shm()` + `reap_shm()` helpers. Match criteria: regular file under `/dev/shm`, owned by current uid, AND no live proc has it open (mmap'd or fd-held). In-use enumeration via `psutil.Process.memory_maps()` + `.open_files()` — xplatform, kernel-canonical (same answer `lsof` would give), no reliance on tractor-specific shm-key naming. - `_ensure_shm_supported()` guard: helpers raise `NotImplementedError` outside Linux/FreeBSD bc macOS POSIX shm has no fs-visible path (`shm_open` only) and Windows is a different story. - `scripts/tractor-reap`: new `--shm` (run after process reap) and `--shm-only` (skip process phase) flags. `-n` dry-runs both phases. Exit code is `1` if either phase had survivors/errors. - `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to the `testing` dep group; lazy-imported in `_reap.py` so the process-reap path stays import-clean without it. Also, - doc `--shm` in `.claude/skills/run-tests/SKILL.md` (new section 10c) — covers match criteria + the preservation guarantee for unrelated apps. - flip mitigation status in `subint_forkserver_mp_shared_memory_issue.md` from "could extend `tractor-reap`" to "implemented", with a note that callers should still UUID-pin shm keys to avoid cross-session collisions. Verified locally vs 81 in-use segments held by `piker`, `lttng-ust-`, `aja-shm-` — all preserved; only the genuinely-orphaned tractor segments got unlinked. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `4f12d69b41`) (factored: dropped subint_forkserver conc-anal doc update)	2026-06-09 23:24:18 -04:00
Gud Boi	6de96b508f	Add `tractor-reap` CLI + document auto-reap New `scripts/tractor-reap` CLI wraps the `_testing._reap` mod for manual zombie-subactor cleanup after crashed pytest sessions. Two modes: - orphan-mode (default): finds PPid==1 procs with cwd matching repo root + `python` in cmdline. - descendant-mode (`--parent <pid>`): scoped sweep under a still-live supervisor. SC-polite: SIGINT with bounded grace window (default 3s) before escalating to SIGKILL. Exit code signals whether escalation was needed (useful for CI health-checks). Also, document both the auto-reap fixture and the CLI in `/run-tests` SKILL.md (section 10). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `6d76b60404`)	2026-06-09 23:24:18 -04:00
Gud Boi	34e28cd2e7	Add `_testing._reap` + auto-reap fixture Zombie-subactor cleanup for the test suite, SC-polite discipline (`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts: a shared reaper module + an autouse session-end fixture that runs it. Deats, - new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes: - `find_descendants(parent_pid)` for the in-session case (PPid-direct-match while pytest is still alive). - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1` reparented to init + `cwd` filter to repo root + `python` cmdline filter). - `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT all, poll up to `grace` for exit, SIGKILL any survivors. Returns `(signalled, killed)` for caller-side reporting. - new `_reap_orphaned_subactors` session-scoped autouse fixture in `tractor/_testing/pytest.py` — after `yield`, runs `find_descendants(os.getpid())` + `reap(...)` so each pytest session leaves no surviving forks. - companion CLI scaffolding lives at `scripts/tractor-reap` (separate commit) for the pytest-died-mid-session case where the in-session fixture didn't get to run. Also, - promote `from tractor.spawn._spawn import SpawnMethodKey` to module-top in `pytest.py` (was inline-imported inside `pytest_generate_tests`), and reuse it in `pytest_collection_modifyitems` to assert each `skipon_spawn_backend` mark arg is a valid spawn-method literal — catches typos at collection time. - inline `# ?TODO` flags running these through the `try_set_backend` checker for stronger validation. Cross-refs `feedback_sc_graceful_cancel_first.md` for the SIGINT-before-SIGKILL discipline rationale. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `eae478f3d5`)	2026-06-09 23:24:18 -04:00
Gud Boi	66029c3732	Default `pytest` to use `--capture=sys` Lands the capture-pipe workaround from the prior cluster of diagnosis commits: switch pytest's `--capture` mode from the default `fd` (redirects fd 1,2 to temp files, which fork children inherit and can deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd 1,2 left alone). Trade-off documented inline in `pyproject.toml`: - LOST: per-test attribution of raw-fd output (C-ext writes, `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI capture, just not per-test-scoped in the failure report. - KEPT: `print()` + `logging` capture per-test (tractor's logger uses `sys.stderr`). - KEPT: `pytest -s` debugging behavior. This allows us to re-enable `test_nested_multierrors` without skip-marking + clears the class of pytest-capture-induced hangs for any future fork-based backend tests. Deats, - `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of rationale comment cross-ref'ing the post-mortem doc - `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')` from `test_nested_ multierrors` — no longer needed. * file-level `pytestmark` covers any residual. - `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail mark loosened from `strict=True` to `strict=False` + reason rewritten. * it passes in isolation but is session-env-pollution sensitive (leftover subactor PIDs competing for ports / inheriting harness FDs). * tolerate both outcomes until suite isolation improves. - `test_shm`: extend the existing `skipon_spawn_backend('subint', ...)` to also skip `'subint_forkserver'`. * Different root cause from the cancel-cascade class: `multiprocessing.SharedMemory`'s `resource_tracker` + internals assume fresh- process state, don't survive fork-without-exec cleanly - `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test (unrelated to forkserver; just a flaky-under-load bump). - `tractor.spawn._subint_forkserver`: inline comment-only future-work marker right before `_actor_child_main()` describing the planned conditional stdout/stderr-to-`/dev/null` redirect for cases where `--capture=sys` isn't enough (no code change — the redirect logic itself is deferred). EXTRA NOTEs ----------- The `--capture=sys` approach is the minimum- invasive fix: just a pytest ini change, no runtime code change, works for all fork-based backends, trade-offs well-understood (terminal-level capture still happens, just not pytest's per-test attribution of raw-fd output). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `4c133ab541`) (factored: dropped spawn-backend-only paths: tests/spawn/test_subint_forkserver.py + tractor/spawn/_subint_forkserver.py; the xfail-loosening bullet above no longer applies) (factored: the test-file mark adjustments ride with the test-hardening segment)	2026-06-09 23:24:18 -04:00
Gud Boi	f0716962c6	Add `skipon_spawn_backend` pytest marker A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...], reason='...')` marker for backend-specific known-hang / -borked cases — avoids scattering `@pytest.mark.skipif(lambda ...)` branches across tests that misbehave under a particular `--spawn-backend`. Deats, - `pytest_configure()` registers the marker via `addinivalue_line('markers', ...)`. - New `pytest_collection_modifyitems()` hook walks each collected item with `item.iter_markers( name='skipon_spawn_backend')`, checks whether the active `--spawn-backend` appears in `mark.args`, and if so injects a concrete `pytest.mark.skip( reason=...)`. `iter_markers()` makes the decorator work at function, class, or module (`pytestmark = [...]`) scope transparently. - First matching mark wins; default reason is `f'Borked on --spawn-backend={backend!r}'` if the caller doesn't supply one. Also, tighten type annotations on nearby `pytest` integration points — `pytest_configure`, `debug_mode`, `spawn_backend`, `tpt_protos`, `tpt_proto` — now taking typed `pytest.Config` / `pytest.FixtureRequest` params. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `3b26b59dad`)	2026-06-09 23:24:18 -04:00
Gud Boi	fe25e2c448	Add global 200s `pytest-timeout` (cherry picked from commit `5998774535`)	2026-06-09 23:24:18 -04:00
Gud Boi	a0e2c08119	Wall-cap `test_stale_entry_is_deleted` via `pytest-timeout` Add a hard process-level wall-clock bound on a test known to wedge un-Ctrl-C-ably under an in-dev spawn backend, so an unattended suite run can't hang indefinitely. Deats, - New `testing` dep: `pytest-timeout>=2.3`. - `test_stale_entry_is_deleted`: `@pytest.mark.timeout(3, method='thread')`. The `method='thread'` choice is deliberate — `method='signal'` routes via `SIGALRM` which can be starved by the same GIL-hostage path that drops `SIGINT`, so it'd never actually fire in the starvation case. At timeout, `pytest-timeout` hard-kills the pytest process itself — that's the intended behavior here; the alternative is the suite never returning. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 189f4e3f72e9f1eda5d24bcbab5743f7e35bd913) (factored: kept pyproject + tests/discovery/test_registrar.py parts of "Wall-cap `subint` audit tests via `pytest-timeout`"; dropped tests/test_subint_cancellation.py)	2026-06-09 23:24:18 -04:00
Gud Boi	0e03c6815b	Add `supervise_run_process` to `trionics._subproc` A `trio.Nursery.start()`-style wrapper around `trio.run_process()` that surfaces rc!=0 errors deterministically, ALWAYS isolates the parent controlling-tty, and optionally live-relays the child's std-streams to `log.<level>` per-line. Suits both short-lived test-runners + long-lived daemons. `supervise_run_process()`, - Deterministic rc!=0: pass `check=False` to `trio` and do our OWN post-drain rc-check from the supervisor coro body AFTER `own_tn.__aexit__` — NOT inside the internal nursery, since that would race-cancel the still-draining relay reader and lose stderr lines. (Re)build + raise a BARE `subprocess.CalledProcessError`: `.stderr=` for programmatic callers + an `add_note()`'d `\|_.stderr:` block for human teardown logs. No nursery-eg-wrapped CPE to `collapse_eg` around. - Parent controlling-tty isolation: `stdin=DEVNULL` always, `stdout=DEVNULL` unless relayed/overridden (via `stdout=` kwarg w/ `_UNSET` sentinel so explicit `None` = inherit still works). Prevents a spawned program from clobbering the launching tty's scrollback w/ control-seqs. - Live per-line relay: `relay_stdout=True`/ `relay_stderr=True` → relayed to `log.<relay_level>` (default `'io'`, our custom level 21). Picked to sort just above stdlib `INFO`=20 so it shows at usual `info`/`devx` levels yet stays separately filterable; `runtime`=15 was REJECTED as a default since it'd be silently filtered at usual verbosity — footgun for daemon supervisors whose whole point is visibility. STREAMED, not buffered-until-exit. - Non-blocking `tn.start()` semantics: live `trio.Process` handed up via `task_status.started()` immediately (else `tn.start()` would block till child exit, losing the long-lived-daemon use case). Supervise/relay bg tasks run to completion in this coro. - `*run_process_kwargs` forwarded verbatim (env, shell, cwd, start_new_session, executable, ...); MANAGED keys (`stdin`/`stdout`/`stderr`/`check`) win on conflict. - Crash-handling layer intentionally NOT baked in — compose `maybe_open_crash_handler()` ON TOP at the call-site. `_relay_stream_lines()` helper, - Concurrent pipe-drain reader. MANDATORY whenever piping w/o `capture_` since nothing else drains the OS pipe — child blocks on `write()` once kernel buf (~64KiB) fills → deadlock. - Modes (combine freely): `emit`-only live relay, `accum`-only silent drain+capture (for the CPE note), or both. Per-line splitting handles cross-chunk residuals + flushes any trailing un-newline-term'd line at EOF. `_add_stderr_note()` helper, - Attaches an indented `\|_.stderr:` note to a CPE via `add_note()` for legible rc!=0 reporting at teardown. Tests (`tests/trionics/test_subproc.py`), - Hermetic `trio`-only (no actor-runtime). - `test_stdout_relayed_per_line`: per-line stdout relay. - `test_parent_tty_isolated`: child fd1 is OUR pipe (no `/dev/pts/*`), fd0 pinned to `/dev/null`. - `test_no_deadlock_on_big_unnewlined_output`: 200KiB no-newline output completes under `fail_after(2)` — exercises the concurrent drain (without it, the child blocks at ~64KiB). - `test_stderr_relay_and_cpe_rebuild`: rc!=0 w/ `relay_stderr=True` → bare `CalledProcessError` w/ the `.stderr` note + per-line live relay. - `test_nonrelay_cpe_note`: rc!=0 w/o relay → same deterministic post-drain CPE w/ `.stderr` note (silent drain+capture path). Re-export `supervise_run_process` from `tractor.trionics`. Prompt-IO: ai/prompt-io/claude/20260601T231429Z_0e3e008b_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `f595acc76c`)	2026-06-09 23:24:18 -04:00
Gud Boi	9c905b390b	Add `add_log_level()` factory + register `IO`=21 Follow-up to `f595acc7` (`supervise_run_process`) which called `log.io(...)` for std-stream relay assuming an `IO=21` level existed. Add the registration via a new factory + tests covering both the factory and the new level. `add_log_level()` factory, - One call wires the four (otherwise hand-synced) pieces: - `CUSTOM_LEVELS[NAME]` — drives the `stacklevel` bump in `StackLevelAdapter.log()` + `get_logger()`'s per-level audit. - `logging.addLevelName()` — stdlib name registration. - `STD_PALETTE[NAME]` + `BOLD_PALETTE['bold'][NAME]` — color entries consumed by `get_console_log()`'s `ColoredFormatter` build. - Same-named (lowercase) emit method bound on `StackLevelAdapter` so `log.<name>('msg')` works + `get_logger()`'s per-level method audit passes. - Idempotent: re-registering an existing name is a no-op-ish refresh that won't clobber an already-bound method. - Method binding uses a default-arg `_level=value` so the level int is captured (not late-bound across multiple registrations). `IO=21` level (first user), - Purple. Used by `tractor.trionics._subproc`'s std-stream relay (see `f595acc7`). - Value 21 picked to sit just ABOVE stdlib `INFO`=20 so it's SHOWN BY DEFAULT at usual `info`/`devx` console levels — a `runtime`=15 relay would be silently filtered (footgun for daemon supervisors whose whole point is visibility). Still distinctly labeled + filterable. Tests (`tests/test_log_sys.py`), - `test_io_custom_level_registered`: validates the IO level is fully wired (`CUSTOM_LEVELS`, `addLevelName`, both palettes, `StackLevelAdapter.io()` callable); emits a record + sanity-asserts `21 >= INFO(20)`. - `test_add_log_level_pluggable`: registers a fresh `XLVL=19` (cyan) via `add_log_level()`, asserts all four wires + the bound `xlog.xlvl()` emit, then try/finally cleans up the module-global mutations so later `get_logger()` audits don't trip on a half-removed level. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `7bd7dd50c7`)	2026-06-09 23:24:18 -04:00
Gud Boi	d55348de20	Add `logspec` leaf-mod Route B follow-up doc Follow-up note documenting why the deeper "Route B" fix for `LogSpec`/`apply_logspec()` true per-leaf-MODULE level control was NOT taken — in favor of the smaller sub-PACKAGE fix that shipped in `9c36363b`. Doc covers, - Status: what `9c36363b` already gives (per-sub-pkg control at any nesting depth, `devx.debug` ≠ `devx`) vs. what remains unaddressed (per-leaf-mod levels, top-level lib mods like `tractor.to_asyncio` on the root logger). - "Route B" sketch: make logger identity the full dotted module path; mv the cosmetic leaf-trim out of logger-naming into the formatter's `{name}` rendering. - 6 breaking-change costs: every logger name changes, formatter rewrite, propagation/double-emit surface grows, level-inheritance semantics shift, `modden`/`piker` contract churn, `get_logger()` refactor risk. - Migration plan if pursued: extract a pure `_mk_logger_name()` helper w/ an exhaustive name-shape test matrix, swap `get_logger()` to use it for identity, swap formatter to use the display string, golden-diff rendered headers, coordinate w/ downstreams. - "Route A" alternative: a `logging.Filter` keyed on `record.module`/`pathname` for per-leaf control w/o name churn — lower risk, narrower power. - Recommendation: defer Route B; prefer Route A if per-leaf is needed soon; the shipped sub-PKG fix covers the common ask. Lives under `ai/tooling-todos/` since it's a deferred- work decision record, not a triage/conc-anal doc. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `5b3c2e3762`)	2026-06-09 23:24:18 -04:00
Gud Boi	11b9a87077	Fix `get_logger()` collapse of nested sub-pkgs Strip the trailing `pkg_path` token ONLY when it duplicates the caller's leaf-module name (which the console header already shows via `{filename}`), instead of blindly dropping the last token. This keeps genuine, possibly-nested sub-PACKAGE parts addressable as their own sub-loggers. - detect a true leaf-mod by comparing the caller's `__name__` vs `__package__` (a pkg `__init__` has them equal -> its trailing token is a real sub-pkg, NOT a leaf to strip). - `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT from a bare `devx` -> `tractor.devx`; the old unconditional `pkg_path = subpkg_path` collapsed both to `tractor.devx` and silently broke per-sub-pkg level control via the logging-spec. - `get_logger(__name__)` leaf-strip still works (cosmetic, bc the leaf-mod is in the `{filename}` header field). Also, - update the `LogSpec` caveat: sub-PACKAGE granularity now addressable at ANY depth; leaf modules intentionally aren't (they're the `{filename}`); top-level mods (eg. `to_asyncio`) still emit on the root logger. - adjust `test_root_pkg_not_duplicated_in_logger_name` to the new literal explicit-`name` contract (no leaf-collapse). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `9c36363b01`)	2026-06-09 23:24:18 -04:00
Gud Boi	7478478038	Lift `--ll`/`--tl` to plugin + `LogSpec` API Two coupled changes that let downstream projects (eg. `modden`) inherit the test-harness loglevel plumbing for free via `tractor._testing.pytest`: Plugin lift (`tests/conftest.py` → `_testing/pytest.py`), - mv `pytest_addoption(--ll)`, the `loglevel` autouse fixture, and `test_log` fixture out of the test-suite- local conftest into the reusable plugin. - add `--tl`/`--tractor-loglevel` as a DISTINCT flag from `--ll`: `--ll` is the consuming-project's OWN app loglevel (scoped to its pkg-hierarchy), `--tl` is the `tractor.` runtime loglevel. `--tl` falls back to `--ll` when unset (preserves current `tractor`-suite behavior). - add `testing_pkg_name` session fixture (default `'tractor'`) — downstream projects override to e.g. `'modden'` so `--ll` scopes to their own hierarchy instead of `tractor.`. - `loglevel` fixture now yields the resolved tractor-runtime level (passed to `open_root_actor(loglevel=<.>)` by `@tractor_test`) AND separately applies `--ll` to the `testing_pkg_name` hierarchy when that isn't `tractor`. `test_log` scopes the per-test logger to `testing_pkg_name`. `tractor.log` "logging-spec" mini-DSL, - `LogSpec = str\|bool`. Accepted forms: - `True` → enable `pkg_name` root at `default_level` (fallback `'cancel'`). - `False` → no-op. - bare level eg. `'info'` → root-logger at that level. - `'sub:info,x:cancel'` → per-sub-logger filter-spec; each `<name>` is RELATIVE to `pkg_name` (must NOT include the pkg-token). - `parse_logspec()` → `{sublog\|None: level}` mapping. `None` key = root-logger. Mixed bare-level + filters in one spec is rejected w/ a helpful err msg; so is embedding the `pkg_name` token in a sub-name. - `apply_logspec()` → `(primary_level, {name: log})`: parses then enables a `colorlog` stderr handler per named (sub)logger. Authoritative sub-logger filters get `propagate=False` so they don't double-emit through a parallel root-level handler. - !GRANULARITY CAVEAT! sub-logger names match at sub-pkg granularity, not leaf-module — so `devx.debug` collapses to the same `tractor.devx` logger as a bare `devx`, and top-level lib modules (eg. `tractor.to_asyncio`) emit under the root logger rather than a phantom `to_asyncio` child. Documented inline on `LogSpec`. Other, - `tests/conftest.py` keeps a NOTE pointing to the plugin for future-debugging clarity (don't remove silently — the lift is the relevant signal). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `19a77708ba`)	2026-06-09 23:24:18 -04:00
Gud Boi	9e09dc5eee	Default `--ll` to `None` in test harness Only override `tractor.log._default_loglevel` when the flag is explicitly passed — lets per-spawn and per-example `loglevel` kwargs take effect instead of being clobbered by the hard-coded `'ERROR'` default. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `72a0465c52`)	2026-06-09 23:24:18 -04:00
Gud Boi	3932daaf4f	Drop `debug_mode` gate on stackscope SIGUSR1 SIGUSR1 task-tree dumps via `stackscope` should work in plain (non-pdb) runs too — esp. in infected-`asyncio` processes where the kernel-default SIGUSR1 disposition is `Term` (proc dies on `kill -USR1` w/o an installed handler). Ungate the install path from `_debug_mode` in both root and sub-actor init; the `use_stackscope` rt-var + `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as the actual opt-in (e.g. via `--enable-stackscope`). Deats, - `_root.open_root_actor`: drop the `debug_mode and ...` conjunction around the `enable_stack_on_sig()` call; now gated only on the `enable_stack_on_sig` arg itself. - `_runtime.Actor` sub-actor init: lift the `use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out of the `if rvs['_debug_mode']:` block to peer-level. The `use_greenback` branch stays inside `_debug_mode` (pdb-specific). - Refresh inline comments on both sites to call out the infected-`asyncio` "default SIGUSR1 = terminate proc" rationale. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `3d9c75b6ed`)	2026-06-09 23:24:18 -04:00
Gud Boi	4da9c3daa8	Add `use_stackscope` runtime var for subactor init Track `stackscope` enablement in `RuntimeVars` so the flag propagates to subactors via the standard rtvar IPC path instead of relying solely on the `TRACTOR_ENABLE_STACKSCOPE` env var. Deats, - add `use_stackscope: bool` to `RuntimeVars` struct + defaults dict - `enable_stack_on_sig()` sets the rtvar on successful `stackscope` import, asserts unset on `ImportError` - nest stackscope init under `_debug_mode` gate in `Actor.async_main`, check rtvar alongside env var - defer `maybe_init_greenback` import to its own `use_greenback` branch (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `48523358cf`)	2026-06-09 23:24:18 -04:00
Gud Boi	c4ec664bfa	Fix `SIGUSR1` tree-dump ordering in `_stackscope` Factor the sub-actor relay loop out of `dump_tree_on_sig()` into `_relay_sig_to_subactors()` and chain both dump + relay in a single `run_sync_soon` callback (`_dump_then_relay`) so the parent's task-tree flushes BEFORE any sub receives the signal — fixes a hierarchical-ordering race where subs could dump ahead of the parent in the muxed pty stream. Also, - gate file/tty sink writes behind `write_file` + `write_tty` params on `dump_task_tree()`. - use `actor.aid.uid` instead of deprecated `.uid`. - update `test_shield_pause` expects to match the new sequential parent -> relay-log -> sub ordering. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `e2b790a70d`)	2026-06-09 23:24:18 -04:00
Gud Boi	14ddd49660	Route `stackscope` SIGUSR1 onto trio loop Signal handlers fire in a non-trio stack frame; calling `stackscope.extract(recurse_child_tasks=True)` from there only walks the `<init>` task and misses everything inside `async_main`'s nurseries — exactly the part you want to see during a hang. Fix: capture `trio.lowlevel.current_trio_token()` at `enable_stack_on_sig()` time and stash it as a module- level `_trio_token`. The SIGUSR1 handler then dispatches the dump onto the trio loop via `_trio_token.run_sync_soon(_safe_dump_task_tree)`, so `stackscope.extract` runs from a real trio-task context and walks the full nursery tree. Late-binding: pytest's `pytest_configure` calls `enable_stack_on_sig()` outside any `trio.run`, so token capture there is a `RuntimeError` — left at `None`. The runtime re-calls `enable_stack_on_sig()` from inside `async_main` (subactor side) where the token IS available, so subactors get the full-tree path. `dump_tree_on_sig` falls back to a direct call when `_trio_token is None` (parent process pre-trio.run, or signal delivered after `trio.run` returns). `_safe_dump_task_tree()` is a `run_sync_soon`-friendly wrapper that swallows any exception from `dump_task_tree()` — trio prints + crashes on uncaught exceptions in scheduled callbacks; better to log + keep the run alive so the user can re-trigger. Other, - emit `capture-bypass tee: <fpath>` line + `tail -f` hint in the rendered dump header so users know where to find the artifact even when stdio is captured. - swap the inline `f' \|_{actor}'` line for a `_pformat.nest_from_op` rendering of `actor_repr` (matches the rest of the runtime's nested-op style). - log lines on handler install + already-installed branches now note `(trio_token captured: <bool>)` so it's obvious from the log whether the full-tree path is wired. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `2d4995e08d`)	2026-06-09 23:24:18 -04:00
Gud Boi	109313d9de	Add `--enable-stackscope` pytest plugin flag New `--enable-stackscope` CLI flag installs a SIGUSR1 → trio-task-tree-dump handler in pytest itself + every spawned subactor for live stack visibility during hang investigations. Lighter than `--tpdb` (no pdb machinery / tty-lock contention) — pure stack-only triage. Plumbing: - `_testing.pytest.pytest_addoption()` adds the flag. - `_testing.pytest.pytest_configure()` (when flag set): * exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children inherit it via environ, * installs the handler in pytest itself via `enable_stack_on_sig()`. - `runtime._runtime.Actor.async_main()` extends the existing `_debug_mode` gate to ALSO fire when `TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors install the same handler at runtime startup. Capture-bypass tee in `dump_task_tree()`: Pytest's default `--capture=fd` swallows `log.devx()` output, making SIGUSR1 dumps invisible right when you need them. Render the dump once to a `full_dump` str, then unconditionally tee to: - `/tmp/tractor-stackscope-<pid>.log` (append-mode, always written) — guaranteed-readable artifact even under CI / `nohup` / no-tty. `tail -f` to follow. - `/dev/tty` (best-effort) — pytest never captures the tty; ignored if device is missing. Other, - squelch the benign `RuntimeWarning` ("coroutine method 'asend'/'athrow' was never awaited") from `stackscope._glue`'s import-time async-gen type introspection so `--enable-stackscope` setup stays quiet. - log msg in the `_runtime` ImportError branch now mentions `--enable-stackscope` alongside debug-mode. Usage, pytest --enable-stackscope -k <hang-test> # in another shell, find the pid + signal: kill -USR1 <pytest-or-subactor-pid> # tail the artifact: tail -f /tmp/tractor-stackscope-<pid>.log (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `5418f2dc3c`) (factored: only the flag + activation hunks; the surrounding skipon-marker/reap-fixture context rides with the testing-harness segment)	2026-06-09 23:24:18 -04:00
Gud Boi	9500d02ef6	Add `._debug_hangs` to `.devx` for hang triage Bottle up the diagnostic primitives that actually cracked the silent mid-suite hangs in the `subint` spawn-backend bringup (issue there" session has them on the shelf instead of reinventing from scratch. Deats, - `dump_on_hang(seconds, , path)` — context manager wrapping `faulthandler.dump_traceback_later()`. Critical gotcha baked in: dumps go to a file, not `sys.stderr`, bc pytest's stderr capture silently eats the output and you can spend an hour convinced you're looking at the wrong thing - `track_resource_deltas(label, , writer)` — context manager logging per-block `(threading.active_count(), len(_interpreters.list_all()))` deltas; quickly rules out leak-accumulation theories when a suite progressively worsens (if counts don't grow, it's not a leak, look for a race on shared cleanup instead) - `resource_delta_fixture(*, autouse, writer)` — factory returning a `pytest` fixture wrapping `track_resource_deltas` per-test; opt in by importing into a `conftest.py`. Kept as a factory (not a bare fixture) so callers own `autouse` / `writer` wiring Also, - export the three names from `tractor.devx` - dep-free on py<3.13 (swallows `ImportError` for `_interpreters`) - link back to the provenance in the module docstring (issue #379 / commit `26fb820`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `09466a1e9d`)	2026-06-09 23:24:18 -04:00
Gud Boi	7f0183d466	Use `is not None` check for peer-connect `event` Matches the explicit `dict.pop(uid, None)` contract one line above; same semantics as the prior truthy check. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `0e3e008b0c`)	2026-06-09 23:08:40 -04:00
Gud Boi	f64b282620	Fix dropped `for/else` re-raise in masking CM `30e15925` ("Add `start_or_cancel()` to `trionics._taskc`") inserted `async def start_or_cancel()` — whose body opens its own col-4 `try:` — immediately before the trailing `else: raise`. Because the edit was a pure insertion (0 deletions), the same `else: raise` lines were silently REPARENTED: they used to be the `for exc_match in matching: ... else: raise` of `maybe_raise_from_masking_exc`, but now bind to `start_or_cancel`'s `try/except` where they're unreachable dead code. Net effect: `maybe_raise_from_masking_exc` lost the `for/else` re-raise of the un-masked exception, so a masked child cancellation gets swallowed instead of surfaced. - restore the `for/else: raise` to `maybe_raise_from_masking_exc` - drop the now-dead `else: raise` from `start_or_cancel` Surfaced as 2 deterministic failures in `test_sigint_closes_lifetime_stack[wait_for_ctx-bg_aio_task- send_SIGINT_to=child-*]` (the SIGINT-to-child "silent-abandon" regime). Bisected with `trio` held at `0.29.0`: clean at `9c36363b` (0/8), broken at `30e15925` (8/8), fixed (0/8). NOT a `trio` (0.29↔0.33 identical) nor logging-plugin regression. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `325574cc07`)	2026-06-09 23:08:40 -04:00
Gud Boi	549fc26516	Add `start_or_cancel()` to `trionics._taskc` Wrapper around `trio.Nursery.start()` that DOESN'T mask out-of-band cancellation as a lossy startup failure. Picks the right re-raise: ambient `Cancelled` when present, the genuine startup-protocol `RuntimeError` otherwise. The problem, - `trio.Nursery.start()` raises a generic `RuntimeError("child exited without calling task_status.started()")` whenever the started task exits BEFORE calling `task_status.started()` — INCLUDING the common case where the child was cancelled out-of-band by an ancestor cancel-scope erroring/cancelling. - In that case the original `trio.Cancelled` is swallowed and the caller is left w/ an opaque, root-cause-detached `RuntimeError`. The fix, - Catch the "...started" RTE. - `await trio.lowlevel.checkpoint_if_cancelled()` — re-raises the in-flight `Cancelled` IFF we're under effective cancellation (ancestor-inclusive), carrying trio's auto-generated reason which points at the true root exc. - If we're NOT cancelled the `checkpoint_if_cancelled()` is a cheap no-op and we fall through to re-raise the genuine startup-protocol RTE. Re-export from `tractor.trionics` so callers don't have to reach into `_taskc`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `30e15925ba`)	2026-06-09 23:08:40 -04:00
Gud Boi	23cc1413dd	Add `maybe_signal_aio_task()` + cause-chain guard Factor the "deliver an exc to a running aio task" pattern out of `translate_aio_errors()` + `open_channel_from()` into a shared `maybe_signal_aio_task()` helper. Add a cause-chain matrix comment + relay-echo guard so the final-raise block can't cycle `trio_err.__cause__` back onto its own derivative relay. `maybe_signal_aio_task()`, - Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT `aio_task.set_exception()` which on py3.13+ ALWAYS raises `RuntimeError("Task does not support set_exception")` (dead code as a relay mechanism). - Returns `(delivered: bool, report: str)`. Caller uses `delivered` to flip `wait_on_aio_task` when delivery failed (avoids hanging on `_aio_task_complete.wait()`). - `pre_captured_fut=`: required when the caller crosses a trio checkpoint between capturing `_fut_waiter` and invoking the helper. `Task._wakeup` clears `_fut_waiter = None` so re-reading post-checkpoint loses the ref even though the exc is still in-flight on the (now-`done()`) original fut. - `cause=`: sets `exc.__cause__ = cause` so the relay carries a "trio_err -> caused -> relay" chain through `set_exception()` → `Task._wakeup` → coro raise → `wait_on_coro_final_result` → `signal_trio_when_done` → `task.result()`-raise. - `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the narrow case where `_fut_waiter is None` AND task is runnable (sitting in asyncio's ready queue, not parked on a poke-able future). NEVER cancels when `_fut_waiter` carries an in-flight exc — that would race + mask the real terminating exc. `translate_aio_errors()`, - Replace the two ad-hoc `_fut_waiter.set_exception()` / `aio_task.set_exception()` call sites w/ the helper. - Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown `trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref). - New "cross-loop cause-chain matrix" comment block on the final-raise — tabulates every `(trio_err, aio_err, trio_to_raise)` combo into exactly one terminal `raise X [from Y]` or early `return`. Covers the sibling `signal_trio_when_done()` resolution + the relay-echo INVARIANT. - New relay-echo guard: if `aio_err` is one of OUR OWN signals (`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is trio_err`, raise the bare `trio_err` instead of `trio_err from aio_err` (which would CYCLE the cause chain since the relay was itself caused-by `trio_err`). - Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT EXCEPT or this WILL HANG" warning — the helper handles the failure path explicitly via `delivered=False` → `wait_on_aio_task = False`. - Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback shows the real root. `open_channel_from()`, - Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?" + `tractor.pause(shield=True)` panic branch. - Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as `cause=` — non-`None` only when the `try` body raised (graceful exit → None). Other, - Top-level import: `sys` (for `sys.exc_info()`). - `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await trio_done_fute` next to the shielded version — exploratory note for the longstanding "why is `.shield()` needed?" TODO. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `acd1cbeec4`)	2026-06-09 23:08:40 -04:00
Gud Boi	d78962c319	Escalate cancel-ack timeouts to `proc.terminate()` Wires SC-discipline cancel-then-escalate into `ActorNursery.cancel()`: graceful cancel-req -> bounded wait -> hard-kill Deats, - add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`. When `True`, bounded- wait expiry raises `ActorTooSlowError` instead of the legacy DEBUG-log + return-`False` path. Default stays `False` for callers that handle their own escalation (e.g. `_spawn.soft_kill()` polling `proc.poll()`). - add `_try_cancel_then_kill()` helper in `_supervise` used by per-child cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()` (SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever waiting on `proc.poll()`. - replace `tn.start_soon(portal.cancel_actor)` in `ActorNursery.cancel()` with the helper. Debug-mode bypass: ----------------- skip escalation (fall back to legacy fire-and-forget cancel) when ANY of: - `Lock.ctx_in_debug is not None` (some actor is currently REPL-locked) - `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`). - `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=` opt-in). ORing covers root-debug, child-debug, and active- REPL-lock cases without false-positively SIGTERM- ing a sub-tree proxying stdio for a REPL session. Motivated by the `subint_forkserver` dup-name hang where a same-named sibling subactor's cancel-RPC failed to ack within `Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and the nursery `__aexit__` deadlocked. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `34f333a026`)	2026-06-09 23:08:40 -04:00
Gud Boi	703973b7c4	Add `ActorTooSlowError` for cancel-cascade timeouts Distinct from `trio.TooSlowError` so that existing `except trio.TooSlowError:` blocks don't silently mask actor-cancel timeouts — these must propagate to let a supervisor escalate to `proc.terminate()` per SC-discipline: graceful cancel-req -> bounded wait -> hard-kill Motivated by #subint_forkserver dup-name hang where `Portal.cancel_actor()` silently swallowed the timeout and the supervisor never escalated, leaving a same-named sibling subactor parked forever. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `38ffb875bd`)	2026-06-09 23:08:40 -04:00
Gud Boi	f6c9665bf1	Tidy proto-guard `ValueError` fmt in `open_root_actor()` Pre-compute `mismatch_lines` str instead of `+`-concat inside the f-string raise site; slightly easier to read and avoids the `+ '\n\n'` continuation. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `5cd06810db`)	2026-06-09 23:08:40 -04:00
Gud Boi	d3b1a68ff9	Add `enable_transports`/`registry_addrs` proto guard Raise `ValueError` from `open_root_actor()` when any `registry_addrs` entry uses a transport proto not in `enable_transports` — historically this caused a silent indefinite hang during the registrar handshake (the actor could never connect to register/discover). Also, - update `test_root_passes_tpt_to_sub` to detect a proto mismatch between parametrized `tpt_proto_key` and CLI `tpt_proto`, asserting the new guard raises `ValueError` with expected msg content. - replace old commented-out notes with a clearer explanation of the mismatch foot-gun. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `d036ef7d7f`)	2026-06-09 23:08:40 -04:00
Gud Boi	bdf0fb0a2e	Fix shutdown deadlock on UDS unlink race Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError` guard — under concurrent pytest sessions the sock-file can already be reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before `_shutdown.set()`, deadlocking `wait_for_shutdown()` on `actor.cancel()`. Also, - close each endpoint independently in the finally so one raise doesn't strand the rest. - always signal `_shutdown.set()` regardless of remaining ep count. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `2ee44a6fdd`)	2026-06-09 23:07:44 -04:00
Gud Boi	aaf696ba4c	Drop global mutation of `_PROC_SPAWN_WAIT` In top level `daemon`-fixture that is.. Use a local `bg_daemon_spawn_delay` instead of mutating the module-level `_PROC_SPAWN_WAIT` — previously each `daemon` fixture invocation would permanently add 1.6s (UDS) or 1s (CI) to the global, inflating delays across the session. Also, emit a `test_log.warning()` when verbose loglevel is silently reduced to `'info'`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `c4885f9d99`)	2026-06-09 23:07:44 -04:00
Gud Boi	2d8bcbb1c4	Add `tractor.trionics.patches` subpkg + first fix With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which can busy-loop due to lack of handling `EOF`. New `tractor.trionics.patches` subpkg housing defensive monkey-patches for upstream `trio` bugs we've encountered while running `tractor` — particularly as of recent, fork-survival edge cases that haven't been filed/fixed upstream yet. Each patch is idempotent, version-gated via `is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the upstream release whose adoption allows deletion. Subpkg layout + per-patch contract documented in `tractor/trionics/patches/README.md` — `apply()` / `is_needed()` / `repro()` API, registry pattern via `_PATCHES` in `__init__.py`, single-call entry point `apply_all()`. First patch, `_wakeup_socketpair`: - `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN). - under `fork()`-spawning backends the COW-inherited socketpair fds & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair` instance whose write-end is closed, and `drain()` then spins forever in C with no Python checkpoints, - this obviously burns 100% CPU and no signal delivery. Standalone repro: from trio._core._wakeup_socketpair import WakeupSocketpair ws = WakeupSocketpair() ws.write_sock.close() ws.drain() # spins forever Patch is one-line — break the drain loop on b'' EOF. Manifested as two distinct test failures: - `tests/test_multi_program.py::test_register_duplicate_name` hung at 100% CPU on the busy-loop directly (fork child's worker thread) - `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`, both threads parked in `epoll_wait`, no TCP connect-back to parent ever happened. Same patch fixes both. Restored 99.7% pass rate on full suite under `--spawn-backend=main_thread_forkserver` (was hanging indefinitely before). Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE any trio runtime init. Harmless on non-fork backends. Conc-anal write-ups, including strace + py-spy evidence: - `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md` - `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md` Regression tests in `tests/trionics/test_patches.py`: each test asserts (a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b) the patch fixes it with a SIGALRM wall-clock cap so a regression hangs loud instead of silently. TODO: - [ ] file the upstream `python-trio/trio` issue + PR. - [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue body's evidence section. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `0ef549fadb`) (factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)	2026-06-09 23:07:44 -04:00
Gud Boi	2c0fefef61	Add `tractor.spawn._reap.unlink_uds_bind_addrs()` Inside a new new `tractor.spawn._reap` submod which kicks off providing post-mortem subactor cleanup primitives, parent-side; consider it the "sibling" of `tractor._testing._reap` which is the test-harness-oriented brother mod. Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454 where `hard_kill()`'s `SIGKILL` bypasses the subactor's `_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking `${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files.. This adds 2 cleanup paths: - explicit `bind_addrs` (when set at spawn time), OR - convention-based reconstruction from `subactor.aid.name + proc.pid` for the random-self-assign case. `.spawn.hard_kill()` now invokes the cleanup unconditionally post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError` skip. Future work — authoritative tracking via a per-process UDS bind-addr registry — documented in module docstring, deferred to a follow-up PR. Co-fix: `tractor/spawn/_trio.py::new_proc` already passes `bind_addrs` + `subactor` to `hard_kill` via prior work on this branch. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `e9712dcaeb`) (factored: the tractor/_testing/_reap.py harness hunk rides with the testing-harness segment instead)	2026-06-09 23:07:44 -04:00
Gud Boi	400ed77ab1	Allow per-call `start_method`/`loglevel` overrides In `tests/devx/conftest.py::spawn`, refactor the fixture-internal closures so consumer tests can pass explicit `start_method`/`loglevel` to each `_spawn()` invocation rather than only inheriting the fixture- scoped parametrize values. Deats, - promote `set_spawn_method()` and `set_loglevel()` to take their respective values as fn params (vs closing over the fixture-scope vars). - give `_spawn()` `start_method=start_method` and `loglevel: str\|None = None` kwargs so callers override one-off without re-parametrizing the suite. NOTE: this drops the implicit fixture- scoped `loglevel` forward — `_spawn()` callers now must pass `loglevel=...` explicitly. - TODO: figure out how `--ll <level>` should map to the default (currently `None` → uses env-var or tractor default). - add a docstring to `_spawn()` so its role as the consumer-facing closure is obvious from `help()`. Also, - `assert_before()` now returns the `.before` output on success (was `None`); add a one-line docstring describing the new return contract. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `486249d74f`)	2026-06-09 23:07:06 -04:00
Gud Boi	c794d5ef44	Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars Add env-var overrides inside `._root.open_root_actor()` so devs/test-runs can swap the actor-spawn backend or crank console verbosity without touching application code. In `._root.open_root_actor()`, - read `TRACTOR_LOGLEVEL` early, overriding any caller-passed `loglevel` and stashing an `env_ll_report` to emit once the console log is set up. - pull the `loglevel` fallback (`or _default_loglevel`) and `log.get_console_log()` init up so the env-var report routes through tractor's own logger. - read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed `start_method` and warn-logging when the env-var clobbers an explicit caller value. Wire the same vars through `tests/devx/conftest.py::spawn`, - request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL` and `TRACTOR_SPAWN_METHOD` in `os.environ` before each `pexpect.spawn()` (inherited by the example subproc). - expand `supported_spawners` to include `main_thread_forkserver` and `subint_forkserver` bc example scripts no longer need per-script CLI plumbing. - pop both vars in fixture teardown so a leaked value can't re-route a later in-process tractor test's spawn-backend or loglevel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `208e7c0926`)	2026-06-09 23:07:06 -04:00
Gud Boi	75bb371d5d	Fix `SharedMemory` under `subint_forkserver` Implements the resolution described in c99d475d's `subint_forkserver_mp_shared_memory_issue.md` (now updated with the resolution post-mortem). Two-part fix that side-steps `mp.resource_tracker` entirely rather than try to make it fork-safe — turns out that's both simpler AND more correct given tractor already SC-manages allocation lifetimes. Deats, - `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches now run unconditionally: * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op `ManTracker` subclass (empty `register` / `unregister` / `ensure_running`). * return `partial(SharedMemory, track=False)` for the per-allocation opt-out. * belt + suspenders: even if something dodges the wrapper, the singleton can't talk to the inherited (broken) parent fd. - `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional skip of the unlink-callback; install a `try_unlink()` wrapper that swallows `FileNotFoundError` (sibling-already-cleaned race in shared-key setups). Without `mp.resource_tracker` doing it for us, we own the unlink — `actor. lifetime_stack` is the right place since tractor already controls actor lifecycle. - `tests/test_shm.py`: uncomment-out `subint_forkserver` from the module-level skip- list (tests pass now). Inline comment cross-refs the two `_mp_bs` / `_shm` workarounds. - `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy rewrite — flips status from "open / unresolvable in tractor" to "resolved, kept as decision record". Adds Resolution section, "Why this is the right call" rationale (mp tracker is widely criticized; tractor already owns lifecycle), trade-offs (crash-leaked segments, lost mp leak warning), verification (7 passed under both `subint_forkserver` and `trio` backends), and upstream issue links (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `aa3e230926`) (factored: dropped subint_forkserver conc-anal doc update) (factored: the tests/test_shm.py skip-mark comment hunk rides with the test-hardening segment instead)	2026-06-09 23:06:42 -04:00
Gud Boi	9604569584	Bound peer-clear wait in `async_main` finally Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `e312a68d8a`) (factored: dropped subint_forkserver conc-anal doc update)	2026-06-09 23:06:09 -04:00
Gud Boi	332ac30636	Break parent-chan shield during teardown Completes the nested-cancel deadlock fix started in `0cd0b633` (fork-child FD scrub) and `fe540d02` (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `8ac3dfeb85`)	2026-06-09 23:06:09 -04:00

1 2 3 4 5 ...

2669 Commits (3d36a06f5e979722fe73422df105f693e5f38bc8) All Branches Search

2669 Commits (3d36a06f5e979722fe73422df105f693e5f38bc8)

All Branches