SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).
Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
conjunction around the `enable_stack_on_sig()` call;
now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
`use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
of the `if rvs['_debug_mode']:` block to peer-level.
The `use_greenback` branch stays inside `_debug_mode`
(pdb-specific).
- Refresh inline comments on both sites to call out the
infected-`asyncio` "default SIGUSR1 = terminate proc"
rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3d9c75b6ed)
Track `stackscope` enablement in `RuntimeVars` so
the flag propagates to subactors via the standard
rtvar IPC path instead of relying solely on the
`TRACTOR_ENABLE_STACKSCOPE` env var.
Deats,
- add `use_stackscope: bool` to `RuntimeVars`
struct + defaults dict
- `enable_stack_on_sig()` sets the rtvar on
successful `stackscope` import, asserts unset
on `ImportError`
- nest stackscope init under `_debug_mode` gate
in `Actor.async_main`, check rtvar alongside
env var
- defer `maybe_init_greenback` import to its own
`use_greenback` branch
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 48523358cf)
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.
Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
* exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
inherit it via environ,
* installs the handler in pytest itself via
`enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
existing `_debug_mode` gate to ALSO fire when
`TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
install the same handler at runtime startup.
Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:
- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
always written) — guaranteed-readable artifact even
under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
tty; ignored if device is missing.
Other,
- squelch the benign `RuntimeWarning` ("coroutine method
'asend'/'athrow' was never awaited") from
`stackscope._glue`'s import-time async-gen type
introspection so `--enable-stackscope` setup stays
quiet.
- log msg in the `_runtime` ImportError branch now
mentions `--enable-stackscope` alongside debug-mode.
Usage,
pytest --enable-stackscope -k <hang-test>
# in another shell, find the pid + signal:
kill -USR1 <pytest-or-subactor-pid>
# tail the artifact:
tail -f /tmp/tractor-stackscope-<pid>.log
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5418f2dc3c)
(factored: only the flag + activation hunks; the surrounding skipon-marker/reap-fixture context rides with the testing-harness segment)
Wires SC-discipline cancel-then-escalate into
`ActorNursery.cancel()`:
graceful cancel-req -> bounded wait -> hard-kill
Deats,
- add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`.
When `True`, bounded- wait expiry raises `ActorTooSlowError` instead
of the legacy DEBUG-log + return-`False` path. Default stays `False`
for callers that handle their own escalation (e.g.
`_spawn.soft_kill()` polling `proc.poll()`).
- add `_try_cancel_then_kill()` helper in `_supervise` used by per-child
cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()`
(SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever
waiting on `proc.poll()`.
- replace `tn.start_soon(portal.cancel_actor)` in
`ActorNursery.cancel()` with the helper.
Debug-mode bypass:
-----------------
skip escalation (fall back to legacy fire-and-forget cancel) when ANY
of:
- `Lock.ctx_in_debug is not None` (some actor is currently
REPL-locked)
- `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`).
- `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=`
opt-in).
ORing covers root-debug, child-debug, and active- REPL-lock cases
without false-positively SIGTERM- ing a sub-tree proxying stdio for
a REPL session.
Motivated by the `subint_forkserver` dup-name hang where a same-named
sibling subactor's cancel-RPC failed to ack within
`Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and
the nursery `__aexit__` deadlocked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 34f333a026)
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.
Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.
Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).
Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.
Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
wait in the teardown cascade above `async_main`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit e312a68d8a)
(factored: dropped subint_forkserver conc-anal doc update)
Completes the nested-cancel deadlock fix started in
0cd0b633 (fork-child FD scrub) and fe540d02 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.
Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.
Deats,
- `async_main` stashes the cancel-scope returned by
`root_tn.start(...)` for the parent-chan
`process_messages` task onto the actor as
`_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
`ipc_server.cancel()` + `wait_for_shutdown()`)
calls `self._parent_chan_cs.cancel()` to
explicitly break the shield — no more waiting for
EOF delivery, unwinding proceeds deterministically
regardless of backend
- inline comments on both sites explain the mutual-
wait deadlock + why the explicit cancel is
backend-agnostic rather than a forkserver-specific
workaround
With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8ac3dfeb85)
Resetting `_runtime_vars` post-(forking-)spawn was
previously only possible via direct mutation of
`_state._runtime_vars` from an external module + an
inline default dict duplicating the
`_state.py`-internal defaults. Split the access
surface into a pure getter + explicit setter so such
a reset call site becomes a one-liner composition:
`set_runtime_vars(get_runtime_vars(clear_values=True))`.
Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
`_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
live `_runtime_vars` is now initialised from
`dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
kwarg. When True, returns a fresh copy of
`_RUNTIME_VARS_DEFAULTS` instead of the live dict —
still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
atomic replacement of the live dict's contents via
`.clear()` + `.update()`, so existing references to the
same dict object remain valid. Accepts either the
historical dict form or the `RuntimeVars` struct
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7804a9fe57693dd5e15bee6a08e7d2fa14b6a98a)
(factored: kept only the tractor/runtime/_state.py part; dropped
tractor/spawn/_subint_forkserver.py call-site rewire)
Pull the `_child.py` `__main__` block body out into
a callable `_actor_child_main()` so alternate spawn
backends can bootstrap a subactor without going
through the CLI entrypoint.
Deats,
- new `_actor_child_main(uid, loglevel, parent_addr,
infect_asyncio, spawn_method='trio')` holds the
full child-side runtime startup previously inlined
under `if __name__ == '__main__':`
- `__main__` block reduces to arg-parsing + a call
into the new func
- add `"subint"` to the `_runtime.py` spawn-method
check so a child accepts `SpawnSpec` from that
(future) backend; inert str-compare w/o it
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b8f243e98d)
(factored: kept only the `_child.py`/`_runtime.py` entry-extraction parts of
"Impl min-viable `subint` spawn backend (B.2)"; dropped
tractor/spawn/_subint.py + subint prompt-io logs)
Deats,
- use `proc.poll() is None` in `sig_prog()` to
distinguish "still running" from exit code 0;
drop stale `breakpoint()` from fallback kill
path (would hang CI).
- add missing `raise` on the `RuntimeError` in
`async_main()` when no tpt bind addrs given.
- clean up stale uid entries from the registrar
`_registry` when addr eviction empties the
addr list.
- update `discovery.__init__` docstring to match
the new eager `._multiaddr` import.
- fix `registar` -> `registrar` typo in teardown
report log msg.
Review: PR #429 (Copilot)
https://github.com/goodboy/tractor/pull/429
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The prior approach eagerly reused `_parent_chan` when
parent IS the registrar, but that channel may still
carry ctx/stream teardown protocol traffic —
concurrent `unregister_actor` RPC causes protocol
conflicts. Now try a fresh `get_registry()` conn
first; only fall back to the parent channel on
`OSError` (listener already closed/unlinked).
Deats,
- fresh `get_registry()` is the primary path for
all addrs regardless of `parent_is_reg`
- `OSError` handler checks `parent_is_reg` +
`rent_chan.connected()` before fallback
- fallback catches `OSError` and
`trio.ClosedResourceError` separately
- drop unused `reg_addr: Address` annotation
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
A backlog of 1 caused `ECONNREFUSED` when multiple
sub-actors simultaneously connect to deregister from
a remote-daemon registrar. Now matches the TCP
transport's default backlog (~128).
Also,
- add cross-ref comments between
`_uds.close_listener()` and `async_main()`'s
`parent_is_reg` deregistration path explaining
the UDS socket-file lifecycle
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
When the parent actor IS the registrar, reuse the existing parent
channel for `unregister_actor` RPC instead of opening a new connection
via `get_registry()`. This avoids failures when the registrar's listener
socket is already closed during teardown (e.g. UDS transport unlinks the
socket file rapidly).
Deats,
- detect `parent_is_reg` by comparing `_parent_chan.raddr` against
`reg_addrs` and if matched, create a `Portal(rent_chan)` directly
instead of `async with get_registry()`.
- rename `failed` -> `failed_unreg` for clarity.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adjust all imports to match.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Allow callers to explicitly declare transport
bind addrs instead of always auto-generating
random ones from ponged registrar addresses.
Deats,
- new `tpt_bind_addrs` kwarg wraps each input
addr via `wrap_address()` at init time.
- non-registrar path only auto-generates random
bind addrs when `tpt_bind_addrs` is empty.
- registrar path merges user-provided bind addrs
with `uw_reg_addrs` via `set()` union.
- drop the deprecated `arbiter_addr` param and
its `DeprecationWarning` shim entirely.
Also,
- expand `registry_addrs` type annotation to
`Address|UnwrappedAddress`.
- replace bare `assert accept_addrs` in
`async_main()` with a descriptive
`RuntimeError` msg.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop the NIH (notinventedhere) custom parser (`parse_maddr()`,
`iter_prot_layers()`, `prots`/`prot_params` tables) which was never
called anywhere in the codebase.
Replace with a thin `mk_maddr()` factory that wraps the upstream
`multiaddr.Multiaddr` type, dispatching on `Address.proto_key` to build
spec-compliant paths.
Deats,
- `'tcp'` addrs detect ipv4 vs ipv6 via stdlib
`ipaddress` (resolves existing TODO)
- `'uds'` addrs map to `/unix/{path}` per the
multiformats protocol registry (code 400)
- fix UDS `.maddr` to include full sockpath
(previously only used `filedir`, dropped filename)
- standardize protocol names: `ipv4`->`ip4`,
`uds`->`unix`
- `.maddr` properties now return `Multiaddr` objs
(`__str__()` gives the canonical path form so all
existing f-string/log consumers work unchanged)
- update `MsgTransport` protocol hint accordingly
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Use walrus `:=` to combine the assignment and
truthiness check for `_parent_main_data` into the
`if` condition, cleanly skipping the fixup block
when `inherit_parent_main=False` yields `{}`.
Review: PR #438 (Copilot)
https://github.com/goodboy/tractor/pull/438
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Clean up mutable defaults, give parent-main bootstrap data a named type, and add direct start_actor coverage so the opt-out change is clearer to review.
Use `inherit_parent_main` across the actor APIs and helper to better describe the behavior, and restore the reviewer note at child bootstrap where the inherited `__main__` data is copied from `SpawnSpec`.
Keep actor-owned parent-main capture and let `_mp_figure_out_main()` decide whether to return `__main__` bootstrap data, avoiding the extra SpawnSpec plumbing while preserving the per-actor flag.
Keep trio child bootstrap data in the spawn handshake instead of stashing it on Actor state so the replay opt-out stays explicit and avoids stale-looking runtime fields.
Let actor callers skip replaying the parent __main__ during child startup so downstream integrations can avoid inheriting incompatible bootstrap state without changing the default spawn behavior.
Move the `Arbiter` class out of `runtime._runtime` into its
logical home at `discovery._registry` as `Registrar(Actor)`.
This completes the long-standing terminology migration from
"arbiter" to "registrar/registry" throughout the codebase.
Deats,
- add new `discovery/_registry.py` mod with `Registrar`
class + backward-compat `Arbiter = Registrar` alias.
- rename `Actor.is_arbiter` attr -> `.is_registrar`;
old attr now a `@property` with `DeprecationWarning`.
- `_root.py` imports `Registrar` directly for
root-actor instantiation.
- export `Registrar` + `Arbiter` from `tractor.__init__`.
- `_runtime.py` re-imports from `discovery._registry`
for backward compat.
Also,
- update all test files to use `.is_registrar`
(`test_local`, `test_rpc`, `test_spawning`,
`test_discovery`, `test_multi_program`).
- update "arbiter" -> "registrar" in comments/docstrings
across `_discovery.py`, `_server.py`, `_transport.py`,
`_testing/pytest.py`, and examples.
- drop resolved TODOs from `_runtime.py` and `_root.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Restructure the flat `tractor/` top-level private mods
into (more nested) subpackages:
- `runtime/`: `_runtime`, `_portal`, `_rpc`, `_state`,
`_supervise`
- `spawn/`: `_spawn`, `_entry`, `_forkserver_override`,
`_mp_fixup_main`
- `discovery/`: `_addr`, `_discovery`, `_multiaddr`
Each subpkg `__init__.py` is kept lazy (no eager
imports) to avoid circular import issues.
Also,
- update all intra-pkg imports across ~35 mods to use
the new subpkg paths (e.g. `from .runtime._state`
instead of `from ._state`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code