Reverts the `_Cache.run_ctx` change from 93aa39db which
moved `resources.pop(ctx_key)` to an outer `finally`
*after* the acm's `__aexit__()`. That introduced an
atomicity gap: `values` was already popped in the inner
finally but `resources` survived through the acm teardown
checkpoints. A re-entering task that creates a fresh lock
(the old one having been popped by the exiting caller)
could then acquire immediately and find stale `resources`
(for which now we raise a `RuntimeError('Caching resources ALREADY
exist?!')`).
Deats,
- the orig 93aa39db rationale was a preemptive guard
against acm `__aexit__()` code accessing `_Cache`
mid-teardown, but no `@acm` in `tractor` (or `piker`) ever
does that; the scenario never materialized.
- by popping both `values` AND `resources` atomically
(no checkpoint between them) in the inner finally,
the re-entry race window is closed: either the new
task sees both entries (cache hit) or neither
(clean cache miss).
- `test_moc_reentry_during_teardown` now passes
without `xfail`! (:party:)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
That is a repo-specific "concurrency-analysis" skill
Bo
Structured analysis framework for `trio`-based async races: inventories
shared mutable state, maps checkpoint boundaries, traces interleaved
task schedules, and proposes fixes. Includes tractor-specific patterns
for `_Cache` lock vs `run_ctx` lifetime, the `values`/`resources`
atomicity gap, and `Event.set()` scheduling semantics.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Documents the diagnostic session tracing why
per-`ctx_key` locking alone doesn't close the
`_Cache.run_ctx` teardown race — the lock pops
in the exiting caller's task but resource cleanup
runs in the `run_ctx` task inside `service_tn`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The per-`ctx_key` locking fix in f086222d intended to resolve the
teardown race reproduced by the new test suite, so the test SHOULD now
pass. TLDR, it doesn't Bp
Also add `collapse_eg()` to the test's ctx-manager stack so that when
run with `pytest <...> --tpdb` we'll actually `pdb`-REPL the RTE when it
hits (previously an assert-error).
(this commit-msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(Hopefully!) solving a long-run bug with the `brokerd.kraken` backend in
`piker`..
- Track `_Cache.users` per `ctx_key` via a `defaultdict[..., int]`
instead of a single global counter; fix premature teardown when
multiple ctx keys are active simultaneously.
- Key `_Cache.locks` on `ctx_key` (not bare `fid`) so different kwarg
sets for the same `acm_func` get independent `StrictFIFOLock`s.
- Add `_UnresolvedCtx` sentinel class to replace bare `None` check;
avoid false-positive teardown when a wrapped acm legitimately yields
`None`.
- Swap resource-exists `assert` for detailed `RuntimeError`.
Also,
- fix "whih" typo.
- add debug logging for lock acquire/release lifecycle.
(this commit-msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reproduce the piker `open_cached_client('kraken')` scenario: identical
`ctx_key` callers share one cached resource, and a new task re-enters
during `__aexit__` — hitting `assert not resources.get()` bc `values`
was popped but `resources` wasn't yet.
Deats,
- `test_moc_reentry_during_teardown` uses an `in_aexit` event to
deterministically land in the teardown window.
- marked `xfail(raises=AssertionError)` against unpatched code (fix in
`9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter
patch branch).
Also, add prompt-io log for the session.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md
Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user
tracking correctly tears down resources independently - exercises the
fix from 02b2ef18 where a global `_Cache.users` counter caused stale
cache hits when the same `acm_func` was called with different kwargs.
Also, add a paired `acm_with_resource()` helper `@acm` that yields its
`resource_id` for per-key testing in the above suite.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md
Namely with multiple pre-sleep `delay`-parametrizations before either,
- parent-scope cancel-calling (as originally) or,
- depending on the new `cancel_by_cs: bool` suite parameter, optionally
just immediately exiting from (the newly named)
`maybe_cancel_outer_cs()` a checkpoint.
In the latter case we ensure we **don't** inf sleep to avoid leaking
those tasks into the `Actor._service_tn` (though we should really have
a better soln for this)..
Deats,
- make `cs` args optional and adjust internal logic to match.
- add some notes around various edge cases and issues with using the
actor-service-tn as the scope by default.
- Use `Type[BaseException]` (not bare `BaseException`)
for all err-type references: `get_err_type()` return,
`._src_type`, `boxed_type` in `unpack_error()`.
- Add `|None` where types can be unresolvable
(`get_err_type()`, `.boxed_type` property).
- Add `._src_type_resolved` flag to prevent repeated
lookups and guard against `._ipc_msg is None`.
- Fix `recevier` and `exeptions` typos.
Review: PR #426 (Copilot)
https://github.com/goodboy/tractor/pull/426
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- Remove leftover `await an.cancel()` in
`test_registered_custom_err_relayed`; the
nursery already cancels on scope exit.
- Fix `This document` -> `This documents` typo in
`test_unregistered_err_still_relayed` docstring.
Review: PR #426 (Copilot)
https://github.com/goodboy/tractor/pull/426
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add a teensie unit test to match.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop the `xfail` test and instead add a new one that ensures the
`tractor._exceptions` fixes enable graceful relay of
remote-but-unregistered error types via the unboxing of just the
`rae.src_type_str/boxed_type_str` content. The test also ensures
a warning is included with remote error content indicating the user
should register their error type for effective cross-actor re-raising.
Deats,
- add `test_unregistered_err_still_relayed`: verify the
`RemoteActorError` IS raised with `.boxed_type`
as `None` but `.src_type_str`, `.boxed_type_str`,
and `.tb_str` all preserved from the IPC msg.
- drop `test_unregistered_boxed_type_resolution_xfail`
since the new above case covers it and we don't need to have
an effectively entirely repeated test just with an inverse assert
as it's last line..
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Make `RemoteActorError` resilient to unresolved
custom error types so that errors from remote actors
always relay back to the caller - even when the user
hasn't called `reg_err_types()` to register the exc type.
Deats,
- `.src_type`: log warning + return `None` instead
of raising `TypeError` which was crashing the
entire `_deliver_msg()` -> `pformat()` chain
before the error could be relayed.
- `.boxed_type_str`: fallback to `_ipc_msg.boxed_type_str`
when the type obj can't be resolved so the type *name* is always
available.
- `unwrap_src_err()`: fallback to `RuntimeError` preserving
original type name + traceback.
- `unpack_error()`: log warning when `get_err_type()` returns
`None` telling the user to call `reg_err_types()`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Verify registered custom error types round-trip correctly over IPC via
`reg_err_types()` + `get_err_type()`.
Deats,
- `TestRegErrTypesPlumbing`: 5 unit tests for the type-registry plumbing
(register, lookup, builtins, tractor-native types, unregistered
returns `None`)
- `test_registered_custom_err_relayed`: IPC end-to-end for a registered
`CustomAppError` checking `.boxed_type`, `.src_type`, and `.tb_str`
- `test_registered_another_err_relayed`: same for `AnotherAppError`
(multi-type coverage)
- `test_unregistered_custom_err_fails_lookup`: `xfail` documenting that
`.boxed_type` can't resolve without `reg_err_types()` registration
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The `open_actor_cluster()` teardown hangs
intermittently on UDS when `gather_contexts(mngrs=())`
raises `ValueError` mid-setup; likely a race in the
actor-nursery cleanup vs UDS socket shutdown. TCP
passes reliably (5/5 runs).
- Add `tpt_proto` fixture param to the test
- `pytest.skip()` on UDS with a TODO for deeper
investigation of `._clustering`/`._supervise`
teardown paths
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Factor the CPU-freq-scaling helper out of
`test_legacy_one_way_streaming` into `conftest.py`
alongside a new `cpu_scaling_factor()` convenience fn
that returns a latency-headroom multiplier (>= 1.0).
Apply it to the two other flaky-timeout tests,
- `test_cancel_via_SIGINT_other_task`: 2s -> scaled
- `test_example[we_are_processes.py]`: 16s -> scaled
Deats,
- add `get_cpu_state()` + `cpu_scaling_factor()` to
`conftest.py` so all test mods can share the logic.
- catch `IndexError` (empty glob) in addition to
`FileNotFoundError`.
- rename `factor` var -> `headroom` at call sites for
clarity on intent.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `get_cpu_state()` helper to read CPU freq settings
from `/sys/devices/system/cpu/` and use it to compensate
the perf time-limit when `auto-cpufreq` (or similar)
scales down the max frequency.
Deats,
- read `*_pstate_max_freq` and `scaling_max_freq`
to compute a `cpu_scaled` ratio.
- when `cpu_scaled != 1.`, increase `this_fast` limit
proportionally (factoring dual-threaded cores).
- log a warning via `test_log` when compensating.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Move the `Arbiter` class out of `runtime._runtime` into its
logical home at `discovery._registry` as `Registrar(Actor)`.
This completes the long-standing terminology migration from
"arbiter" to "registrar/registry" throughout the codebase.
Deats,
- add new `discovery/_registry.py` mod with `Registrar`
class + backward-compat `Arbiter = Registrar` alias.
- rename `Actor.is_arbiter` attr -> `.is_registrar`;
old attr now a `@property` with `DeprecationWarning`.
- `_root.py` imports `Registrar` directly for
root-actor instantiation.
- export `Registrar` + `Arbiter` from `tractor.__init__`.
- `_runtime.py` re-imports from `discovery._registry`
for backward compat.
Also,
- update all test files to use `.is_registrar`
(`test_local`, `test_rpc`, `test_spawning`,
`test_discovery`, `test_multi_program`).
- update "arbiter" -> "registrar" in comments/docstrings
across `_discovery.py`, `_server.py`, `_transport.py`,
`_testing/pytest.py`, and examples.
- drop resolved TODOs from `_runtime.py` and `_root.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adjust all `tractor._state`, `tractor._addr`,
`tractor._supervise`, etc. refs in tests and examples
to use the new `runtime/`, `discovery/`, `spawn/` paths.
Also,
- use `tractor.debug_mode()` pub API instead of
`tractor._state.debug_mode()` in a few test mods
- add explicit `timeout=20` to `test_respawn_consumer_task`
`@tractor_test` deco call
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Restructure the flat `tractor/` top-level private mods
into (more nested) subpackages:
- `runtime/`: `_runtime`, `_portal`, `_rpc`, `_state`,
`_supervise`
- `spawn/`: `_spawn`, `_entry`, `_forkserver_override`,
`_mp_fixup_main`
- `discovery/`: `_addr`, `_discovery`, `_multiaddr`
Each subpkg `__init__.py` is kept lazy (no eager
imports) to avoid circular import issues.
Also,
- update all intra-pkg imports across ~35 mods to use
the new subpkg paths (e.g. `from .runtime._state`
instead of `from ._state`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Refactor the test-fn deco to use `wrapt.decorator`
instead of `functools.wraps` for better fn-sig
preservation and optional-args support via
`PartialCallableObjectProxy`.
Deats,
- add `timeout` and `hide_tb` deco params
- wrap test-fn body with `trio.fail_after(timeout)`
- consolidate per-fixture `if` checks into a loop
- add `iscoroutinefunction()` type-check on wrapped fn
- set `__tracebackhide__` at each wrapper level
Also,
- update imports for new subpkg paths:
`tractor.spawn._spawn`, `tractor.discovery._addr`,
`tractor.runtime._state`
(see upcoming, likely large patch commit ;)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Export the new `RuntimeVars` struct and `get_runtime_vars()`
from `tractor.__init__` and improve the accessor to
optionally return the struct form.
Deats,
- add `RuntimeVars` and `get_runtime_vars` to
`__init__.py` exports; alphabetize `_state` imports.
- move `get_runtime_vars()` up in `_state.py` to sit
right below `_runtime_vars` dict definition.
- add `as_dict: bool = True` param so callers can get
either the legacy `dict` or the new `RuntimeVars`
struct.
- drop the old stub fn at bottom of `_state.py`.
- rm stale `from .msg.pretty_struct import Struct` comment.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
So we can start transition from runtime-vars `dict` to a typed struct
for better clarity and wire-ready monitoring potential, as well as
better traceability when .
Deats,
- add a new `RuntimeVars(Struct)` with all fields from `_runtime_vars`
dict typed out
- include `__setattr__()` with `breakpoint()` for debugging
any unexpected mutations.
- add `.update()` method for batch-updating compat with `dict`.
- keep old `_runtime_vars: dict` in place (we need to port a ton of
stuff to adjust..).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Allow external app code to register custom exception types
on `._exceptions` so they can be re-raised on the receiver
side of an IPC dialog via `get_err_type()`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Expose a copy of the current actor's `_runtime_vars` dict
via a public fn; TODO to convert to `RuntimeVars` struct.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- Add `LocalPortal` union to `query_actor()` return
type and `reg_portal` var annotation since the
registrar yields a `LocalPortal` instance.
- Update docstring to note the `LocalPortal` case.
- Widen `.delete_addr()` `addr` param to accept
`list[str|int]` bc msgpack deserializes tuples as
lists over IPC.
- Tighten `uid` annotation to `tuple[str, str]|None`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`msgpack` deserializes tuples as lists over IPC so
the `bidict.inverse.pop()` needs a `tuple`-cast to
match registry keys.
Regressed-by: 85457cb (`registry_addrs` change)
Found-via: `/run-tests` test_stale_entry_is_deleted
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- Use `bidict.forceput()` in `register_actor()` to handle
duplicate addr values from stale entries or actor restarts.
- Fix `uid` annotation to `tuple[str, str]|None` in
`maybe_open_portal()` and handle the `None` return from
`delete_addr()` in log output.
- Pass explicit `registry_addrs=[reg_addr]` to `open_nursery()`
and `find_actor()` in `test_stale_entry_is_deleted` to ensure
the test uses the remote registrar.
- Update `query_actor()` docstring to document the new
`(addr, reg_portal)` yield shape.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix potential `AttributeError` when `query_actor()` yields
a `None` portal (peer-found-locally path) and an `OSError`
is raised during transport connect.
Also,
- fix `Arbiter.delete_addr()` return type to
`tuple[str, str]|None` bc it can return `None`.
- fix "registar" typo -> "registrar" in comment.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
By spawning an actor task that immediately shuts down the transport
server and then sleeps, verify that attempting to connect via the
`._discovery.find_actor()` helper delivers `None` for the `Portal`
value.
Relates to #184 and #216
In cases where an actor's transport server task (by default handling new
TCP connections) terminates early but does not de-register from the
pertaining registry (aka the registrar) actor's address table, the
trying-to-connect client actor will get a connection error on that
address. In the case where client handles a (local) `OSError` (meaning
the target actor address is likely being contacted over `localhost`)
exception, make a further call to the registrar to delete the stale
entry and `yield None` gracefully indicating to calling code that no
`Portal` can be delivered to the target address.
This issue was originally discovered in `piker` where the `emsd`
(clearing engine) actor would sometimes crash on rapid client
re-connects and then leave a `pikerd` stale entry. With this fix new
clients will attempt connect via an endpoint which will re-spawn the
`emsd` when a `None` portal is delivered (via `maybe_spawn_em()`).
Since stale addrs can be leaked where the actor transport server task
crashes but doesn't (successfully) unregister from the registrar, we
need a remote way to remove such entries; hence this new (registrar)
method.
To implement this make use of the `bidict` lib for the `._registry`
table thus making it super simple to do reverse uuid lookups from an
input socket-address.
- `test_inter_peer_cancellation`: swap all `.uid` refs
on `Actor`, `Channel`, and `Portal` to `.aid.uid`
- `test_legacy_one_way_streaming`: same + fix `print()`
to multiline style
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Use 6s timeout on non-linux (vs 4s) in
`test_cancel_while_childs_child_in_sync_sleep()` to avoid
flaky `TooSlowError` on slower CI runners.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- add `tpt_proto: ['tcp', 'uds']` matrix dimension
to the `testing` job.
- exclude `uds` on `macos-latest` for now.
- pass `--tpt-proto=${{ matrix.tpt_proto }}` to the
`pytest` invocation.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Removes the `pytest` deprecation warns and attempts to avoid
some de-registration raciness, though i'm starting to think the
real issue is due to not having the fixes from #366 (which handle
the new dereg on `OSError` case from UDS)?
- use `.channel.aid.uid` over deprecated `.channel.uid`
throughout `test_discovery.py`.
- add polling loop (up to 5s) for subactor de-reg check
in `spawn_and_check_registry()` to handle slower
transports like UDS where teardown takes longer.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code