Commit Graph

25 Commits (9e09dc5eee1592ec1d512abc7cc9db22879e5d86)

Author SHA1 Message Date
Gud Boi 3932daaf4f Drop `debug_mode` gate on stackscope SIGUSR1
SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).

Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
  conjunction around the `enable_stack_on_sig()` call;
  now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
  `use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
  of the `if rvs['_debug_mode']:` block to peer-level.
  The `use_greenback` branch stays inside `_debug_mode`
  (pdb-specific).
- Refresh inline comments on both sites to call out the
  infected-`asyncio` "default SIGUSR1 = terminate proc"
  rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3d9c75b6ed)
2026-06-09 23:24:18 -04:00
Gud Boi 4da9c3daa8 Add `use_stackscope` runtime var for subactor init
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.

Deats,
- add `use_stackscope: bool` to `RuntimeVars`
  struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
  successful `stackscope` import, asserts unset
  on `ImportError`
- nest stackscope init under `_debug_mode` gate
  in `Actor.async_main`, check rtvar alongside
  env var
- defer `maybe_init_greenback` import to its own
  `use_greenback` branch

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 48523358cf)
2026-06-09 23:24:18 -04:00
Gud Boi 109313d9de Add `--enable-stackscope` pytest plugin flag
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.

Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
  * exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
    inherit it via environ,
  * installs the handler in pytest itself via
    `enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
  existing `_debug_mode` gate to ALSO fire when
  `TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
  install the same handler at runtime startup.

Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:

- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
  always written) — guaranteed-readable artifact even
  under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
  tty; ignored if device is missing.

Other,
- squelch the benign `RuntimeWarning` ("coroutine method
  'asend'/'athrow' was never awaited") from
  `stackscope._glue`'s import-time async-gen type
  introspection so `--enable-stackscope` setup stays
  quiet.
- log msg in the `_runtime` ImportError branch now
  mentions `--enable-stackscope` alongside debug-mode.

Usage,
  pytest --enable-stackscope -k <hang-test>
  # in another shell, find the pid + signal:
  kill -USR1 <pytest-or-subactor-pid>
  # tail the artifact:
  tail -f /tmp/tractor-stackscope-<pid>.log

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5418f2dc3c)

(factored: only the flag + activation hunks; the surrounding skipon-marker/reap-fixture context rides with the testing-harness segment)
2026-06-09 23:24:18 -04:00
Gud Boi d78962c319 Escalate cancel-ack timeouts to `proc.terminate()`
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:

  graceful cancel-req -> bounded wait -> hard-kill

Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
  When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
  of the legacy DEBUG-log + return-`False` path. Default stays `False`
  for callers that handle their own escalation (e.g.
  `_spawn.soft_kill()` polling `proc.poll()`).

- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
  cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
  (SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
  waiting on `proc.poll()`.

- replace `tn.start_soon(portal.cancel_actor)` in
  `ActorNursery.cancel()` with the helper.

Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
  REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
  opt-in).

ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.

Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 34f333a026)
2026-06-09 23:08:40 -04:00
Gud Boi 9604569584 Bound peer-clear wait in `async_main` finally
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e312a68d8a)
(factored: dropped subint_forkserver conc-anal doc update)
2026-06-09 23:06:09 -04:00
Gud Boi 332ac30636 Break parent-chan shield during teardown
Completes the nested-cancel deadlock fix started in
0cd0b633 (fork-child FD scrub) and fe540d02 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8ac3dfeb85)
2026-06-09 23:06:09 -04:00
Gud Boi 0133bf7074 Refactor `_runtime_vars` into pure get/set API
Resetting `_runtime_vars` post-(forking-)spawn was
previously only possible via direct mutation of
`_state._runtime_vars` from an external module + an
inline default dict duplicating the
`_state.py`-internal defaults. Split the access
surface into a pure getter + explicit setter so such
a reset call site becomes a one-liner composition:
`set_runtime_vars(get_runtime_vars(clear_values=True))`.

Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
  `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
  live `_runtime_vars` is now initialised from
  `dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
  kwarg. When True, returns a fresh copy of
  `_RUNTIME_VARS_DEFAULTS` instead of the live dict —
  still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
  atomic replacement of the live dict's contents via
  `.clear()` + `.update()`, so existing references to the
  same dict object remain valid. Accepts either the
  historical dict form or the `RuntimeVars` struct

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7804a9fe57693dd5e15bee6a08e7d2fa14b6a98a)
(factored: kept only the tractor/runtime/_state.py part; dropped
 tractor/spawn/_subint_forkserver.py call-site rewire)
2026-06-09 23:06:09 -04:00
Gud Boi cd48d92c19 Extract `_actor_child_main()` as shared child entry
Pull the `_child.py` `__main__` block body out into
a callable `_actor_child_main()` so alternate spawn
backends can bootstrap a subactor without going
through the CLI entrypoint.

Deats,
- new `_actor_child_main(uid, loglevel, parent_addr,
  infect_asyncio, spawn_method='trio')` holds the
  full child-side runtime startup previously inlined
  under `if __name__ == '__main__':`
- `__main__` block reduces to arg-parsing + a call
  into the new func
- add `"subint"` to the `_runtime.py` spawn-method
  check so a child accepts `SpawnSpec` from that
  (future) backend; inert str-compare w/o it

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b8f243e98d)
(factored: kept only the `_child.py`/`_runtime.py` entry-extraction parts of
 "Impl min-viable `subint` spawn backend (B.2)"; dropped
 tractor/spawn/_subint.py + subint prompt-io logs)
2026-06-09 23:06:09 -04:00
Gud Boi ed65301d32 Fix misc bugs caught by Copilot review
Deats,
- use `proc.poll() is None` in `sig_prog()` to
  distinguish "still running" from exit code 0;
  drop stale `breakpoint()` from fallback kill
  path (would hang CI).
- add missing `raise` on the `RuntimeError` in
  `async_main()` when no tpt bind addrs given.
- clean up stale uid entries from the registrar
  `_registry` when addr eviction empties the
  addr list.
- update `discovery.__init__` docstring to match
  the new eager `._multiaddr` import.
- fix `registar` -> `registrar` typo in teardown
  report log msg.

Review: PR #429 (Copilot)
https://github.com/goodboy/tractor/pull/429

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-14 19:54:15 -04:00
Gud Boi 8817032c90 Prefer fresh conn for unreg, fallback to `_parent_chan`
The prior approach eagerly reused `_parent_chan` when
parent IS the registrar, but that channel may still
carry ctx/stream teardown protocol traffic —
concurrent `unregister_actor` RPC causes protocol
conflicts. Now try a fresh `get_registry()` conn
first; only fall back to the parent channel on
`OSError` (listener already closed/unlinked).

Deats,
- fresh `get_registry()` is the primary path for
  all addrs regardless of `parent_is_reg`
- `OSError` handler checks `parent_is_reg` +
  `rent_chan.connected()` before fallback
- fallback catches `OSError` and
  `trio.ClosedResourceError` separately
- drop unused `reg_addr: Address` annotation

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-14 19:54:15 -04:00
Gud Boi 70dc60a199 Bump UDS `listen()` backlog 1 -> 128 for multi-actor unreg
A backlog of 1 caused `ECONNREFUSED` when multiple
sub-actors simultaneously connect to deregister from
a remote-daemon registrar. Now matches the TCP
transport's default backlog (~128).

Also,
- add cross-ref comments between
  `_uds.close_listener()` and `async_main()`'s
  `parent_is_reg` deregistration path explaining
  the UDS socket-file lifecycle

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-14 19:54:15 -04:00
Gud Boi 7b04b2cdfc Reuse `_parent_chan` to unregister from parent-registrar
When the parent actor IS the registrar, reuse the existing parent
channel for `unregister_actor` RPC instead of opening a new connection
via `get_registry()`. This avoids failures when the registrar's listener
socket is already closed during teardown (e.g. UDS transport unlinks the
socket file rapidly).

Deats,
- detect `parent_is_reg` by comparing `_parent_chan.raddr` against
  `reg_addrs` and if matched, create a `Portal(rent_chan)` directly
  instead of `async with get_registry()`.
- rename `failed` -> `failed_unreg` for clarity.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-14 19:54:14 -04:00
Gud Boi c3d6cc9007 Rename `discovery._discovery` to `._api`
Adjust all imports to match.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-14 19:54:14 -04:00
Gud Boi bc60aa1ec5 Add `tpt_bind_addrs` param to `open_root_actor()`
Allow callers to explicitly declare transport
bind addrs instead of always auto-generating
random ones from ponged registrar addresses.

Deats,
- new `tpt_bind_addrs` kwarg wraps each input
  addr via `wrap_address()` at init time.
- non-registrar path only auto-generates random
  bind addrs when `tpt_bind_addrs` is empty.
- registrar path merges user-provided bind addrs
  with `uw_reg_addrs` via `set()` union.
- drop the deprecated `arbiter_addr` param and
  its `DeprecationWarning` shim entirely.

Also,
- expand `registry_addrs` type annotation to
  `Address|UnwrappedAddress`.
- replace bare `assert accept_addrs` in
  `async_main()` with a descriptive
  `RuntimeError` msg.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-14 19:54:14 -04:00
Gud Boi 926e861f52 Use upstream `py-multiaddr` for `._multiaddr`
Drop the NIH (notinventedhere) custom parser (`parse_maddr()`,
`iter_prot_layers()`, `prots`/`prot_params` tables) which was never
called anywhere in the codebase.

Replace with a thin `mk_maddr()` factory that wraps the upstream
`multiaddr.Multiaddr` type, dispatching on `Address.proto_key` to build
spec-compliant paths.

Deats,
- `'tcp'` addrs detect ipv4 vs ipv6 via stdlib
  `ipaddress` (resolves existing TODO)
- `'uds'` addrs map to `/unix/{path}` per the
  multiformats protocol registry (code 400)
- fix UDS `.maddr` to include full sockpath
  (previously only used `filedir`, dropped filename)
- standardize protocol names: `ipv4`->`ip4`,
  `uds`->`unix`
- `.maddr` properties now return `Multiaddr` objs
  (`__str__()` gives the canonical path form so all
  existing f-string/log consumers work unchanged)
- update `MsgTransport` protocol hint accordingly

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-14 19:54:14 -04:00
Gud Boi 27bf566d75 Guard `_mp_fixup_main` on non-empty parent data
Use walrus `:=` to combine the assignment and
truthiness check for `_parent_main_data` into the
`if` condition, cleanly skipping the fixup block
when `inherit_parent_main=False` yields `{}`.

Review: PR #438 (Copilot)
https://github.com/goodboy/tractor/pull/438

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-10 20:58:54 -04:00
Gud Boi acf6568275 Clarify `inherit_parent_main` docstring scope
Note the opt-out only applies to the trio spawn
backend; `multiprocessing` `spawn`/`forkserver`
reconstruct `__main__` via stdlib bootstrap.

Review: PR #438 (Copilot)
https://github.com/goodboy/tractor/pull/438

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-10 20:58:54 -04:00
mahmoud 00637764d9 Address review follow-ups for parent-main inheritance opt-out
Clean up mutable defaults, give parent-main bootstrap data a named type, and add direct start_actor coverage so the opt-out change is clearer to review.
2026-04-10 20:58:54 -04:00
mahmoud ea971d25aa Rename parent-main inheritance flag.
Use `inherit_parent_main` across the actor APIs and helper to better describe the behavior, and restore the reviewer note at child bootstrap where the inherited `__main__` data is copied from `SpawnSpec`.
2026-04-10 20:58:54 -04:00
mahmoud 83b6c4270a Simplify parent-main replay opt-out.
Keep actor-owned parent-main capture and let `_mp_figure_out_main()` decide whether to return `__main__` bootstrap data, avoiding the extra SpawnSpec plumbing while preserving the per-actor flag.
2026-04-10 20:58:54 -04:00
mahmoud 6309c2e6fc Route parent-main replay through SpawnSpec
Keep trio child bootstrap data in the spawn handshake instead of stashing it on Actor state so the replay opt-out stays explicit and avoids stale-looking runtime fields.
2026-04-10 20:58:54 -04:00
mahmoud f5301d3fb0 Add per-actor parent-main replay opt-out
Let actor callers skip replaying the parent __main__ during child startup so downstream integrations can avoid inheriting incompatible bootstrap state without changing the default spawn behavior.
2026-04-10 20:58:54 -04:00
Gud Boi be5d8da8c0 Just alias `Arbiter` via assignment 2026-04-02 17:59:13 -04:00
Gud Boi 9ec2749ab7 Rename `Arbiter` -> `Registrar`, mv to `discovery._registry`
Move the `Arbiter` class out of `runtime._runtime` into its
logical home at `discovery._registry` as `Registrar(Actor)`.
This completes the long-standing terminology migration from
"arbiter" to "registrar/registry" throughout the codebase.

Deats,
- add new `discovery/_registry.py` mod with `Registrar`
  class + backward-compat `Arbiter = Registrar` alias.
- rename `Actor.is_arbiter` attr -> `.is_registrar`;
  old attr now a `@property` with `DeprecationWarning`.
- `_root.py` imports `Registrar` directly for
  root-actor instantiation.
- export `Registrar` + `Arbiter` from `tractor.__init__`.
- `_runtime.py` re-imports from `discovery._registry`
  for backward compat.

Also,
- update all test files to use `.is_registrar`
  (`test_local`, `test_rpc`, `test_spawning`,
  `test_discovery`, `test_multi_program`).
- update "arbiter" -> "registrar" in comments/docstrings
  across `_discovery.py`, `_server.py`, `_transport.py`,
  `_testing/pytest.py`, and examples.
- drop resolved TODOs from `_runtime.py` and `_root.py`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-02 17:59:13 -04:00
Gud Boi cc42d38284 Mv core mods to `runtime/`, `spawn/`, `discovery/` subpkgs
Restructure the flat `tractor/` top-level private mods
into (more nested) subpackages:

- `runtime/`: `_runtime`, `_portal`, `_rpc`, `_state`,
  `_supervise`
- `spawn/`: `_spawn`, `_entry`, `_forkserver_override`,
  `_mp_fixup_main`
- `discovery/`: `_addr`, `_discovery`, `_multiaddr`

Each subpkg `__init__.py` is kept lazy (no eager
imports) to avoid circular import issues.

Also,
- update all intra-pkg imports across ~35 mods to use
  the new subpkg paths (e.g. `from .runtime._state`
  instead of `from ._state`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-02 17:59:13 -04:00