In cases where an actor's transport server task (by default handling new
TCP connections) terminates early but does not de-register from the
pertaining registry (aka the registrar) actor's address table, the
trying-to-connect client actor will get a connection error on that
address. In the case where client handles a (local) `OSError` (meaning
the target actor address is likely being contacted over `localhost`)
exception, make a further call to the registrar to delete the stale
entry and `yield None` gracefully indicating to calling code that no
`Portal` can be delivered to the target address.
This issue was originally discovered in `piker` where the `emsd`
(clearing engine) actor would sometimes crash on rapid client
re-connects and then leave a `pikerd` stale entry. With this fix new
clients will attempt connect via an endpoint which will re-spawn the
`emsd` when a `None` portal is delivered (via `maybe_spawn_em()`).
Since stale addrs can be leaked where the actor transport server task
crashes but doesn't (successfully) unregister from the registrar, we
need a remote way to remove such entries; hence this new (registrar)
method.
To implement this make use of the `bidict` lib for the `._registry`
table thus making it super simple to do reverse uuid lookups from an
input socket-address.
- add `is_valid` and `sockpath.resolve()` asserts in
`get_rando_addr()` for the `'uds'` case plus an
explicit `UDSAddress` type annotation.
- rename no-runtime sockname prefixes from
`'<unknown-actor>'`/`'root'` to
`'no_runtime_root'`/`'no_runtime_actor'` with a proper
if/else branch in `UDSAddress.get_random()`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add UDS skip-guard to `test_streaming_to_actor_cluster()`
and plumb `tpt_proto` through the `@tractor_test` wrapper
so transport-parametrized tests can receive it.
Deats,
- skip cluster test when `tpt_proto == 'uds'` with
descriptive msg, add TODO about `@pytest.mark.no_tpt`.
- add `tpt_proto: str|None` param to inner wrapper in
`tractor_test()`, forward to decorated fn when its sig
accepts it.
- register custom `no_tpt` marker via `pytest_configure()`
to avoid unknown-marker warnings.
- add masked todo for `no_tpt` marker-check code in `tpt_proto` fixture
(needs fn-scope to work, left as TODO).
- add `request` param to `tpt_proto` fixture for future
marker inspection.
Also,
- add doc-string to `test_streaming_to_actor_cluster()`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Namely the workaround expected exc branches added in ef7ed7a for the UDS
parametrization. With the new boxing of the underlying CREs as
tpt-closed, we can expect the same exc outcomes as in the TCP cases.
Also this tweaks some error report logging content used while debugging
this,
- properly `repr()` the `TransportClosed.src_exc`-type from
the maybe emit in `.report_n_maybe_raise()`.
- remove the redudant `chan.raddr` from the "closed abruptly"
header in the tpt-closed handler of `._rpc.process_messages()`,
the `Channel.__repr__()` now contains it by default.
Forward the `tpt_proto` fixture val into spawned daemon
subprocesses via `run_daemon(enable_transports=..)` and
sync `_runtime_vars['_enable_tpts']` in the `tpt_proto`
fixture so sub-actors inherit the transport setting.
Deats,
- add `enable_transports={enable_tpts}` to the daemon
spawn-cmd template in `tests/conftest.py`.
- set `_state._runtime_vars['_enable_tpts']` in the
`tpt_proto` fixture in `_testing/pytest.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
I started getting annoyed by all the warnings from `pytest` during work
on macos suport in CI, so this replaces all `Actor.uid`/`Channel.uid`
accesses with `.aid.uid` (or `.aid.reprol()` for log msgs) across the
core runtime and IPC subsystems to avoid the noise.
This also provides incentive to start the adjustment to all
`.uid`-holding/tracking internal `dict`-tables/data-structures to
instead use `.msg.types.Aid`. Hopefully that will come a (vibed?) follow
up shortly B)
Deats,
- `._context`: swap all `self._actor.uid`, `self.chan.uid`,
and `portal.actor.uid` refs to `.aid.uid`; use
`.aid.reprol()` for log/error formatting.
- `._rpc`: same treatment for `actor.uid`, `chan.uid` in
log msgs and cancel-scope handling; fix `str(err)` typo
in `ContextCancelled` log.
- `._runtime`: update `chan.uid` -> `chan.aid.uid` in ctx
cache lookups, RPC `Start` msg, registration and
cancel-request handling; improve ctxc log formatting.
- `._spawn`: replace all `subactor.uid` with
`.aid.uid` for child-proc tracking, IPC peer waiting,
debug-lock acquisition, and nursery child dict ops.
- `._supervise`: same for `subactor.uid` in cancel and
portal-wait paths; use `actor.aid.uid` for error dict.
- `._state`: fix `last.uid` -> `last.aid.uid` in
`current_actor()` error msg.
Also,
- `._chan`: make `Channel.aid` a proper `@property` backed
by `._aid` so we can add validation/typing later.
- `.log`: use `current_actor().aid.uuid` instead of
`.uid[1]` for actor-uid log field.
- `.msg.types`: add TODO comment for `Start.aid` field
conversion.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deliver `(LinkedTaskChannel, Any)` instead of the prior `(first, chan)`
order from `open_channel_from()` to match the type annotation and be
consistent with `trio.open_*_channel()` style where the channel obj
comes first.
- flip `yield first, chan` -> `yield chan, first`
- update type annotation + docstring to match
- swap all unpack sites in tests and examples
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Address valid findings from copilot's PR #413 review
(https://github.com/goodboy/tractor/pull/413
#pullrequestreview-3925876037):
- `.get()` docstring referenced non-existent
`._from_trio` attr, correct to `._to_aio`.
- `.send()` docstring falsely claimed error-raising
on missing `from_trio` arg; reword to describe the
actual `.put_nowait()` enqueue behaviour.
- `.open_channel_from()` return type annotation had
`tuple[LinkedTaskChannel, Any]` but `yield` order
is `(first, chan)`; fix annotation + docstring to
match actual `tuple[Any, LinkedTaskChannel]`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
This change is masked out now BUT i'm leaving it in for reference.
I was debugging a multi-actor fault where the primary source actor was
an infected-aio-subactor (`brokerd.ib`) and it seemed like the REPL was only
entering on the `trio` side (at a `.open_channel_from()`) and not
eventually breaking in the `asyncio.Task`. But, since (changing
something?) it seems to be working now, it's just that the `trio` side
seems to sometimes handle before the (source/causing and more
child-ish) `asyncio`-task, which is a bit odd and not expected..
We could likely refine (maybe with an inter-loop-task REPL lock?) this
at some point and ensure a child-`asyncio` task which errors always
grabs the REPL **first**?
Lowlevel deats/further-todos,
- add (masked) `maybe_open_crash_handler()` block around
`asyncio.Task` execution with notes about weird parent-addr
delivery bug in `test_sync_pause_from_aio_task`
* yeah dunno what that's about but made a bug; seems to be IPC
serialization of the `TCPAddress` struct somewhere??
- add inter-loop lock TODO for avoiding aio-task clobbering
trio-tasks when both crash in debug-mode
Also,
- change import from `tractor.devx.debug` to `tractor.devx`
- adjust `get_logger()` call to use new implicit mod-name detection
added to `.log.get_logger()`, i.e. sin `name=__name__`.
- some teensie refinements to `open_channel_from()`:
* swap return type annotation for to `tuple[LinkedTaskChannel, Any]`
(was `Any`).
* update doc-string to clarify started-value delivery
* add err-log before `.pause()` in what should be an unreachable path.
* add todo to swap the `(first, chan)` pair to match that of ctx..
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
With methods to comms similar to those that exist for the `trio` side,
- `.get()` which proxies verbatim to the `._to_aio: asyncio.Queue`,
- `.send_nowait()` which thin-wraps to `._to_trio: trio.MemorySendChannel`.
Obviously the more correct design is to break up the channel type into
a pair of handle types, one for each "side's" task in each event-loop,
that's hopefully coming shortly in a follow up patch B)
Also,
- fill in some missing doc strings, tweak some explanation comments and
update todos.
- adjust the `test_aio_errors_and_channel_propagates_and_closes()` suite
to use the new `chan` fn-sig-API with `.open_channel_from()` including
the new methods for msg comms; ensures everything added here works e2e.
Skip fields starting with `_` in pretty-printed struct output
to avoid cluttering displays with internal/private state (and/or accessing
private properties which have errors Bp).
Deats,
- add `if k[0] == '_': continue` check to skip private fields
- change nested `if isinstance(v, Struct)` to `elif` since we
now have early-continue for private fields
- mv `else:` comment to clarify it handles top-level fields
- fix indentation of `yield` statement to only output
non-private, non-nested fields
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Per the questionable `copilot` review which is detailed for follow up in
https://github.com/goodboy/tractor/issues/418. These constants are
directly linked from the kernel sources fwiw.
Though it was a good (vibed) try by @dnks, the previous "fix" was not
actually adding unix socket support but merely sidestepping a crash due
to `get_peer_info()`'s impl never going to work on MacOS (and it was
never intended to).
This patch instead solves the underlying issue by implementing a new
`get_peer_pid()` helper which does in fact retrieve the peer's PID in
a more generic/cross-platform way (:fingers_crossed:); much thanks to
the linked SO answer for this solution!
Impl deats,
- add `get_peer_pid()` and call it from
`MsgpackUDSStream.get_stream_addrs()` when we detect a non-'linux'
platform, OW use the original soln: `get_stream_addrs()`.
- add a new case for the `match (peername, sockname)` with a
`case (str(), str()):` which seems to at least work on macos.
- drop all the `LOCAL_PEERCRED` dynamic import branching since it was
never needed and was never going to work.
Same problem as for the `ShmArray` tokens, so tweak and reuse
the `_shorten_key_for_macos()` helper and call it from
`open_shm_list()` similarly.
Some tweaks/updates to the various helpers,
- support `prefix/suffix` inputs and if provided take their lengths and
subtract them from the known *macOS shm_open() has a 31 char limit
(PSHMNAMLEN)* when generating and using the `hashlib.sha256()` value
which overrides (for now..) wtv `key` is passed by the caller.
- pass the appropriate `suffix='_first/_last'` values for the `ShmArray`
token generators case.
- add a `prefix: str = 'shml_'` param to `open_shm_list()`.
- better log formatting with `!r` to report any key shortening.
Adapt the `PSHMNAMLEN` fix from `piker.data._sharedmem` (orig commit
96fb79ec thx @dnks!) to `tractor.ipc._shm` accounting for the
module-local differences:
- Add `hashlib` import for sha256 key hashing
- Add `key: str|None` field to `NDToken` for storing
the original descriptive key separate from the
(possibly shortened) OS-level `shm_name`
- Add `__eq__()`/`__hash__()` to `NDToken` excluding
the `key` field from identity comparison
- Add `_shorten_key_for_macos()` using `t_` prefix
(vs piker's `p_`) with 16 hex chars of sha256
- Use `platform.system() == 'Darwin'` in `_make_token()`
(tractor already imports the `platform` module vs
piker's `sys.platform`)
- Wrap `shm_unlink()` in `ShmArray.destroy()` with
`try/except FileNotFoundError` for teardown races
(was already done in `SharedInt.destroy()`)
- Move token creation before `SharedMemory()` alloc in
`open_shm_ndarray()` so `token.shm_name` is used
as the OS-level name
- Use `lookup_key` pattern in `attach_shm_ndarray()`
to decouple `_known_tokens` dict key from OS name
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Move the multi-platorm-supporting conditional/dynamic `socket` constant
imports to *after* the main cross-platform ones.
Also add constant typing and reformat comments a bit for the macOS case.
Make socket credential imports platform-conditional in `.ipc._uds`.
- Linux: use `SO_PASSCRED`/`SO_PEERCRED` from socket module
- macOS: use `LOCAL_PEERCRED` (0x0001) instead, no need for `SO_PASSCRED`
- Conditionally call `setsockopt(SO_PASSCRED)` only on Linux
Fixes AttributeError on macOS where SO_PASSCRED doesn't exist.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Thanks to the `tox`-dev community for such a lovely pkg which seems to
solves all the current cross-platform user-dir problems B)
Also this,
- now passes `platformdirs.user_runtime_dir(appname='tractor')`
and allows caller to pass an optional `subdir` under `tractor/`
if desired.
- drops the `.config._rtdir: Path` mod var.
- bumps the lock file with the new dep.
Expand and clarify the comment for the default `case _`
block in the `.send()` error matcher, noting that we
console-error and raise-thru for unexpected disconnect
conditions.
(this patch was suggested by copilot in,
https://github.com/goodboy/tractor/pull/411)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Improve the `spawn` fixture teardown logic in
`tests/devx/conftest.py` fixing the while-else bug, and fix
`test_advanced_faults` genexp for `TransportClosed` exc type
checking.
Deats,
- replace broken `while-else` pattern with direct
`if ptyproc.isalive()` check after the SIGINT loop.
- fix undefined `spawned` ref -> `ptyproc.isalive()` in
while condition.
- improve walrus expr formatting in timeout check (multiline
style).
Also fix `test_ipc_channel_break_during_stream()` assertion,
- wrap genexp in `all()` call so it actually checks all excs
are `TransportClosed` instead of just creating an unused
generator.
(this patch was suggested by copilot in,
https://github.com/goodboy/tractor/pull/411)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Brief descriptions for both fns in `._discovery` clarifying
what each delivers and under what conditions.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Flesh out missing method doc-strings, improve log msg formatting and
assert -> `RuntimeError` for un-inited tpt layer.
Deats,
- add doc-string to `.send()` noting `TransportClosed` raise
on comms failures.
- add doc-string to `.recv()`.
- expand `._aiter_msgs()` doc-string, line-len reflow.
- add doc-string to `.connected()`.
- convert `assert self._transport` -> `RuntimeError` raise
in `._aiter_msgs()` for more explicit crashing.
- expand `_connect_chan()` doc-string, note it's lowlevel
and suggest `.open_portal()` to user instead.
- factor out `src_exc_str` in `TransportClosed` log handler
to avoid double-call
- use multiline style for `.connected()` return expr.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Remove the `trio.ClosedResourceError` and `trio.BrokenResourceError`
handling that should now be subsumed by `TransportClosed` re-raising out
of the `.ipc` stack.
Deats,
- drop CRE and BRE from `._streaming.MsgStream.aclose()/.send()` blocks.
- similarly rm from `._context.open_context_from_portal()`.
- also from `._portal.Portal.cancel_actor()` and drop the
(now-completed-todo) comment about this exact thing.
Also add comment in `._rpc.try_ship_error_to_remote()` noting the
remaining `trio` catches there are bc the `.ipc` layers *should* be
wrapping them; thus `log.critical()` use is warranted.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `TransportClosed` to except clauses where `trio`'s own
resource-closed errors are already caught, ensuring our
higher-level tpt exc is also tolerated in those same spots.
Likely i will follow up with a removal of the `trio` variants since most
*should be* caught and re-raised as tpt-closed out of the `.ipc` stack
now?
Add `TransportClosed` to various handler blocks,
- `._streaming.MsgStream.aclose()/.send()` except blocks.
- the broken-channel except in `._context.open_context_from_portal()`.
- obvi import it where necessary in those ^ mods.
Adjust `test_advanced_faults` suite + exs-script to match,
- update `ipc_failure_during_stream.py` example to catch
`TransportClosed` alongside `trio.ClosedResourceError`
in both the break and send-check paths.
- shield the `trio.sleep(0.01)` after tpt close in example to avoid
taskc-raise/masking on that checkpoint since we want to simulate
waiting for a user to send a KBI.
- loosen `ExceptionGroup` assertion to `len(excs) <= 2` and ensure all
excs are `TransportClosed`.
- improve multi-line formatting, minor style/formatting fixes in
condition expressions.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Refine tpt-error reporting to include closure attribution (`'locally'`
vs `'by peer'`), tighten match conditions and reduce needless newlines
in exc reprs.
Deats,
- factor out `trans_err_msg: str` and `by_whom: str` into a `dict`
lookup before the `match:` block to pair specific err msgs to closure
attribution strings.
- use `by_whom` directly as `CRE` case guard condition
(truthy when msg matches known underlying CRE msg content).
- conveniently include `by_whom!r` in `TransportClosed` message.
- fix `'locally ?'` -> `'locally?'` in send-side `CRE`
handler (drop errant space).
- add masked `maybe_pause_bp()` calls at both `CRE` sites (from when
i was tracing a test harness issue where the UDS socket path wasn't
being cleaned up on teardown).
- drop trailing `\n` from `body=` args to `TransportClosed`.
- reuse `trans_err_msg` for the `BRE`/broken-pipe guard.
Also adjust testing, namely `test_ctxep_pauses_n_maybe_ipc_breaks`'s
expected patts-set for new msg formats to be raised out of
`.ipc._transport`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Internal helper which falls back to sync `pdb` when the
child actor can't reach root to acquire the TTY lock.
Useful when debugging tpt layer failures (intentional or
otherwise) where a sub-actor can no longer IPC-contact the
root to coordinate REPL access; root uses `.pause()` as
normal while non-root falls back to `mk_pdb().set_trace()`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Unmask the CRE case block for peer-closed socket errors which already
had a TODO about reproducing the condition. It appears this case can
happen during inter-actor comms teardowns in `piker`, but i haven't been
able to figure out exactly what reproduces it yet..
So activate the block again for that 'socket already closed'-msg case,
and add a TODO questioning how to reproduce it.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
A partial revert of commit c05d08e426 since it seem we already
suppress tpt-closed errors lower down in `.ipc.Channel.send()`; given
that i'm pretty sure this new handler code should basically never run?
Left in a todo to remove the masked content once i'm done more
thoroughly testing under `piker`.
For IPC-disconnects-during-teardown edge cases, augment some `._rpc`
machinery,
- in `._invoke()` around the `await chan.send(return_msg)` where we
suppress if the underlying `Channel` already disconnected.
- add a disjoint handler in `_errors_relayed_via_ipc()` which just
reports-n-reraises the exc (same as prior behaviour).
* originally i thought it needed to be handled specially (to avoid
being crash handled) but turns out that isn't necessary?
* hence the also-added-bu-masked-out `debug_filter` / guard expression
around the `await debug._maybe_enter_pm()` line.
- show the `._invoke()` frame for the moment.
Move `_root_addrs` assignment to after `async_main()` unblocks (via
`.started()`) which now delivers the bind addrs , ensuring correct
`UnwrappedAddress` propagation into `._state._runtime_vars` for
non-registar root actors..
Previously for non-registrar root actors the `._state._runtime_vars`
entries were being set as `Address` values which ofc IPC serialize
incorrectly rn vs. the unwrapped versions, (well until we add a msgspec
for their structs anyway) and thus are passed in incorrect form to
children/subactors during spawning..
This fixes the issue by waiting for the `.ipc.*` stack to
bind-and-resolve any randomly allocated addrs (by the OS) until after
the initial `Actor` startup is complete.
Deats,
- primarily, mv `_root_addrs` assignment from before `root_tn.start()`
to after, using started(-ed) `accept_addrs` now delivered from
`._runtime.async_main()`..
- update `task_status` type hints to match.
- unpack and set the `(accept_addrs, reg_addrs)` tuple from
`root_tn.start()` call into `._state._runtime_vars` entries.
- improve and embolden comments distinguishing registrar vs non-registrar
init paths, ensure typing reflects wrapped vs. unwrapped addrs.
Also,
- add a masked `mk_pdb().set_trace()` for debugging `raddrs` values
being "off".
- add TODO about using UDS on linux for root mailbox
- rename `trans_bind_addrs` -> `tpt_bind_addrs` for clarity.
- expand comment about random port allocation for
non-registrar case
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Use new implicit module-name detection throughout codebase to simplify
logger creation and leverage auto-naming from caller mod .
Main changes,
- drop `name=__name__` arg from all `get_logger()` calls
(across 29 modules).
- update `get_console_log()` calls to include `name='tractor'` for
enabling root logger in test harness and entry points; this ensures
logic in `get_logger()` triggers so that **all** `tractor`-internal
logging emits to console.
- add info log msg in test `conftest.py` showing test-harness
log level
Also,
- fix `.actor.uid` ref to `.actor.aid.uid` in `._trace`.
- adjust a `._context` log msg formatting for clarity.
- add TODO comments in `._addr`, `._uds` for when we mv to
using `multiaddr`.
- add todo for `RuntimeVars` type hint TODO in `.msg.types` (once we
eventually get that all going obvi!)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Overhaul of the automatic-calling-module-name detection and sub-log
creation logic to avoid (at least warn) on duplication(s) and still
handle the common usage of a call with `name=__name__` from a mod's top
level scope.
Main `get_logger()` changes,
- refactor auto-naming logic for implicit `name=None` case such that we
handle at least `tractor` internal "bare" calls from internal submods.
- factor out the `get_caller_mod()` closure (still inside
`get_logger()`)for introspecting caller's module with configurable
frame depth.
- use `.removeprefix()` instead of `.lstrip()` for stripping pkg-name
from mod paths
- mv root-logger creation before sub-logger name processing
- improve duplicate detection for `pkg_name` in `name`
- add `_strict_debug=True`-only-emitted warnings for duplicate
pkg/leaf-mod names.
- use `print()` fallback for warnings when no actor runtime is up at
call time.
Surrounding tweaks,
- add `.level` property to `StackLevelAdapter` for getting
current emit level as lowercase `str`.
- mv `_proj_name` def to just above `get_logger()`
- use `_curr_actor_no_exc` partial in `_conc_name_getters`
to avoid runtime errors
- improve comments/doc-strings throughout
- keep some masked `breakpoint()` calls for future debugging
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
That is when no `name` is passed to `get_logger()`, try to introspect
the caller's `module.__name__` and use it to infer/get the "namespace
path" to that module the same as if using `name=__name__` as in the most
common usage.
Further, change the `_root_name` to be `pkg_name: str`, a public and
more obvious param name, and deprecate the former. This obviously adds
the necessary impl to make the new
`test_sys_log::test_implicit_mod_name_applied_for_child` test pass.
Impl detalles for `get_logger()`,
- add `pkg_name` and deprecate `_root_name`, include failover logic
and a warning.
- implement calling module introspection using
`inspect.stack()/getmodule()` to get both the `.__name__` and
`.__package__` info alongside adjusted logic to set the `name`
when not provided but only when a new `mk_sublog: bool` is set.
- tweak the `name` processing for implicitly set case,
- rename `sub_name` -> `pkg_path: str` which is the path
to the calling module minus that module's name component.
- only partition `name` if `pkg_name` is `in` it.
- use the `_root_log` for `pkg_name` duplication warnings.
Other/related,
- add types to various public mod vars missing them.
- rename `.log.log` -> `.log._root_log`.
To start ensuring that when `name=__name__` is passed we try to
de-duplicate the `_root_name` and any `leaf_mod: str` since it's already
included in the headers as `{filename}`.
Deats,
- heavily document the de-duplication `str.partition()`s in
`.log.get_logger()` and provide the end fix by changing the predicate,
`if rname == 'tractor':` -> `if rname == _root_name`.
* also toss in some warnings for when we still detect duplicates.
- add todo comments around logging "filters" (vs. our "adapter").
- create the new `test_log_sys.test_root_pkg_not_duplicated()` which
runs green with the fixes from ^.
- add a ton of test-suite todos both for existing and anticipated
logging sys feats in the new mod.
Such that we are able to (finally) detect when we should
`Context._scope.cancel()` specifically when the `.parent_task` is
**not** blocking on receiving from the underlying `._rx_chan`, since if
the task is blocking on `.receive()` it will call `.cancel()`
implicitly.
This is a lot to explain with very little code actually needed for the
implementation (are we like `trio` yet anyone?? XD) but the main jist is
that `Context._maybe_cancel_and_set_remote_error()` needed the
additional case of calling `._scope.cancel()` whenever we know that
a remote-error/ctxc won't be immediately handled, bc user code is doing
non `Context`-API things, and result in a similar outcome as if that
task was waiting on `Context.wait_for_result()` or `.__aexite__()`.
Impl details,
- add a new `._is_blocked_on_rx_chan()` method which predicates whether
the (new) `.parent_task` is blocking on `._rx_chan.receive()`.
* see various stipulations about the current impl and how we might
need to adjust for the future given `trio`'s commitment to the
`Task.custom_sleep_data` attr..
- add `.parent_task`, a pub wrapper for `._task`.
- check for `not ._is_blocked_on_rx_chan()` before manually cancelling
the local `.parent_task`
- minimize the surrounding branch case expressions.
Other,
- tweak a couple logs.
- add a new `.cancel()` pre-started msg.
- mask the `.cancel_called` setter, it's only (been) used for tracing.
- todos around maybe moving the `._nursery` allocation "around" the
`.start_remote_task()` call and various subsequent tweaks therein.
Since it turns out there's even case(s) in `trio` core that are guilty
(of implementing things like checkpoints in exc handlers), this adds
facility for ignoring explicit cases via `inspect.FrameInfo` field
matching from the unmasked `exc_ctx` within
`maybe_raise_from_masking_exc()`.
Impl deats,
- use `inspect.getinnerframes()/getmodule()` to extract the equivalent
"guilty place in code" which raised the masked error which we'd like
to ignore and **not unmask**.
- start a `_mask_cases: dict` which describes the entries to ignore
by matching against a specific `FrameInfo`'s fields from indexed
from `getinnerframes()`.
- describe in that table the case i hit with `trio.WouldBlock` being
always masked by a `Cancelled` due to way `trio.Lock.acquire()`
implements the blocking case in the would-block handler..
- always call into a new `is_expected_masking_case()` predicate (from
`maybe_raise_from_masking_exc()`) on matching `exc_ctx` types.
The correct ordering is to de-alloc the surrounding `service_n`
+ `trio.Event` **after** the `mng` teardown ensuring the
`mng.__aexit__()` never can hit a ref-error if it touches either (like
if a `tn` is passed to `maybe_open_context()`!
Such that key->value pairs can be defined which should *never be*
unmasked where values of
- the keys are exc-types which might be masked, and
- the values are exc-types which masked the equivalent key.
For example, the default includes:
- KBI->taskc: a kbi should never be unmasked from its masking
`trio.Cancelled`.
For the impl, a new `do_warn: bool` in the fn-body determines the
primary guard for whether a warning or re-raising is necessary.
Including all caller usage throughout. Moving to a non-`except*` impl
means it's never needed as a signal from the caller - we can just catch
the beg outright (like we should have always been doing)..
That is from `maybe_raise_from_masking_exc()` thus minimizing us to
a single `except BaseException` block with logic branching for the beg
vs. `unmask_from` exc cases.
Also,
- raise val-err when `unmask_from` is not a `tuple`.
- tweak the exc-note warning format.
- drop all pausing from dev work.
Turns out we weren't despite the optional `stream_handler_nursery` input
to `Server.listen_on()`; fail over to the `Server._stream_handler_tn`
allocated during server setup in those cases.
Turns out I didn't read my own internals docs/comments and despite it
not being used previously, this adds the real use case: a root,
per-actor, scope which ensures parent comms are the last conc-thing to
be cancelled.
Also, the impl changes here make the test from 6410e45 (or wtv
it's rebased to) pass, i.e. we can support crash handling in the root
actor despite the root-tn having been (self) cancelled.
Superficial adjustments,
- rename `Actor._service_n` -> `._service_tn` everywhere.
- add asserts to `._runtime.async_main()` which ensure that the any
`.trionics.maybe_open_nursery()` calls against optionally passed
`._[root/service]_tn` are allocated-if-not-provided (the
`._service_tn`-case being an i-guess-prep-for-the-future-anti-pattern
Bp).
- obvi adjust all internal usage to match new naming.
Serious/real-use-case changes,
- add (back) a `Actor._root_tn` which sits a scope "above" the
service-tn and is either,
+ assigned in `._runtime.async_main()` for sub-actors OR,
+ assigned in `._root.open_root_actor()` for the root actor.
**THE primary reason** to keep this "upper" tn is that during
a full-`Actor`-cancellation condition (more details below) we want to
ensure that the IPC connection with a sub-actor's parent is **the last
thing to be cancelled**; this is most simply implemented by ensuring
that the `Actor._parent_chan: .ipc.Channel` is handled in an upper
scope in `_rpc.process_messages()`-subtask-terms.
- for the root actor this `root_tn` is allocated in `.open_root_actor()`
body and assigned as such.
- extend `Actor.cancel_soon()` to be cohesive with this entire teardown
"policy" by scheduling a task in the `._root_tn` which,
* waits for the `._service_tn` to complete and then,
* cancels the `._root_tn.cancel_scope`,
* includes "sclangy" console logging throughout.
I masked it bc it doesn't seem to actually work for the case I was
testing (`emsd` clobbering a `paperboi` in `piker`..) but figured I'd
leave it as a reminder for solving this problem more generally (#320)
since this is likely the place in the code for a soln.
When i tested it in my case it just resulted in a hang around the `with
debug.acquire_debug_lock()` for some reason? Can't remember if the child
ended up being able to REPL without issue though..
Such that we handle them despite a cancellation condition. This is
almost always the case, that `root_tn.cancel_scope.cancel_called` is
set, by the time the `debug._maybe_enter_pm()` hits. Previous I guess we
just weren't actually ever REPL-debugging such cases?
TODO, still needs a test obvi!
That is the `target` can declare a `chan: LinkedTaskChannel` instead of
`to_trio`/`from_aio`.
To support it,
- change `.started()` -> the more appropriate `.started_nowait()` which
can be called sync from the aio child task.
- adjust the `provide_channels` assert to accept either fn sig
declaration (for now).
Still needs test(s) obvi..
Makes the newly added `test_aio_side_raises_before_started` test pass by
ensuring errors raised by any `.to_asyncio.open_channel_from()` spawned
child-`asyncio.Task` are relayed by any caught `trio.EndOfChannel` by
checking for a new `LinkedTaskChannel._closed_by_aio_task: bool`.
Impl deats,
- obvi add `LinkedTaskChannel._closed_by_aio_task: bool = False`
- in `translate_aio_errors()` always check for the new flag on EOC
conditions and in such cases set `chan._trio_to_raise = aio_err` such
that the `trio`-parent-task always raises the child's exception
directly, OW keep original EoC passthrough in place.
- include *very* detailed per-case comments around the extended handler.
- adjust re-raising logic with a new `raise_from` where we only give the
`aio_err` priority if it's not already set as to `trio_to_raise`.
Also,
- hide the `_run_asyncio_task()` frame by def.
Been meaning to do this forever and a recent test hang finally drove me
to it Bp
Like it sounds, adopt the "cancel-status" properties on `ActorNursery`
use already on our `Context` and derived from `trio.CancelScope`:
- add new private `._cancel_called` (set in the head of `.cancel()`)
& `._cancelled_caught` (set in the tail) instance vars with matching
read-only `@properties`.
- drop the instance-var and instead delegate a `.cancelled: bool`
property to `._cancel_called` and add a usage deprecation warning
(since removing it breaks a buncha tests).
So i need to either adjust the tests or figure out if/why this is needed
to avoid the crashing in `pikerd` i found when killin the chart during
a long backfill with `binance` backend..
Just like in the BRE case (for UDS) it seems when a peer closes the
(UDS?) socket `trio` instead raises a `ClosedResourceError` which we now
catch and re-raise as a `TransportClosed`. This again results in
`tpt.send()` calls from the rpc-runtime **not** raising when it's known
that the IPC channel is disconnected.
Shift around comments and expressions for better reading, assign
`tpt_closed` for easier introspection from REPL during debug oh and fix
the `MsgpackTransport.pformat()` to render '|_peers: 1' .. XD
Similar to what was just changed for `Context.repr_state`, when the
child task is cancelled but by a different "layer" of the runtime (i.e.
a `Portal.cancel_actor()` / `SIGINT`-to-process canceller) we don't
dump a traceback instead just `log.cancel()` emit.
Such that if the local task hasn't resolved but is `trio.Cancelled` and
a `.canceller` was set, we report a `'actor-cancelled'` from
`.repr_state: str`. Bit of formatting to avoid needless newlines too!
Since it's merely a local-file-sys subdirectory and there should be no
reason file creation conflicts with other bind spaces.
Also add 2 test suites to match,
- `tests/ipc/test_each_tpt::test_uds_bindspace_created_implicitly` to
verify the dir creation when DNE.
- `..test_uds_double_listen_raises_connerr` to ensure a double bind
raises a `ConnectionError` from the src `OSError`.
Using `.get_stream_addrs()` such that we always (*can*) assign the peer
end's PID in the `._raddr`.
Also factor common `ConnectionError` re-raising into
a `_reraise_as_connerr()`-@cm.
Such that the caller can be responsible for their own (nursery) scoping
as needed and, for the latter fn's case with
a `trio.Nursery.CancelStatus.encloses()` check to ensure the `tn` is
a valid parent-ish.
Some deats,
- in `gather_contexts()`, mv the `try/finally` outside the nursery block
to ensure we always do the `parent_exit`.
- for `maybe_open_context()` we do a naive task-tree hierarchy audit to
ensure the provided scope is not *too* child-ish (with what APIs `trio`
gives us, see above), OW go with the old approach of using the actor's
private service nursery.
Also,
* better report `trio.Cancelled` around the cache-miss `yield`
cases and ensure we **never** unmask triggering key-errors.
* report on any stale-state with the mutex in the `finally` block.
Since the `await service_n.start()` on key-err can be cancel-masked
(checkpoint interrupted before `_Cache.run_ctx` completes), we need to
always `lock.release()` in to avoid lock-owner-state corruption and/or
inf-hangs in peer cache-hitting tasks.
Deats,
- add a `try/except/finally` around the key-err triggered cache-miss
`service_n.start(_Cache.run_ctx, ..)` call, reporting on any taskc
and always `finally` unlocking.
- fill out some log msg content and use `.debug()` level.
As mentioned in prior testing commit, it can cause the worst kind of
hangs, the SIGINT ignoring kind.. Pretty sure there was never any reason
outside some esoteric multi-actor debugging case, and pretty sure that
already was solved?
Namely `test_empty_mngrs_input_raises()` was failing due to
lazy-iterator use as input to `mngrs` which i guess i added support for
a while back (by it doing a `list(mngrs)` internally)? So just change it
to `gather_contexts(mngrs=())` and also tweak the `trio.fail_after(3)`
since it appears that the prior 1sec was causing
too-fast-of-a-cancellation (before the cluster fully spawned) and thus
the expected `ValueError` never to show..
Also, mask the `tractor.trionics.collapse_eg()` usage (again?) in
`open_actor_cluster()` since it seems unnecessary.
Seems that the way the actor-nursery interacts with the
`.trionics.gather_contexts()` API on cancellation makes our
`.trionics.collapse_eg()` not work as intended?
I need to dig into how `ActorNursery.cancel()` and `.__aexit__()` might
be causing this discrepancy..
Consider this a commit-of-my-index type save for rn.
It was originally this way; I forgot to flip it back when discarding the
`except*` handler impl..
Specially handle the `exc.__cause__` case where we raise from any
detected underlying cause and OW `from None` to suppress the eg's tb.
Since it turns out the semantics are basically inverse of normal
`except` (particularly for re-raising) which is hard to get right, and
bc it's a lot easier to just delegate to what `trio` already has behind
the `strict_exception_groups=False` setting, Bp
I added a rant here which will get removed shortly likely, but i think
going forward recommending against use of `except*` is prudent for
anything low level enough in the runtime (like trying to filter begs).
Dirty deats,
- copy `trio._core._run.collapse_exception_group()` to here with only
a slight mod to remove the notes check and tb concatting for the
collapse case.
- rename `maybe_collapse_eg()` - > `get_collapsed_eg()` and delegate it
directly to the former `trio` fn; return `None` when it returns the
same beg without collapse.
- simplify our own `collapse_eg()` to either raise the collapsed `exc`
or original `beg`.
I dunno what exactly I was thinking but we definitely don't want to
**ever** raise from the original exc-group, instead always raise from
any original `.__cause__` to be consistent with the embedded src-error's
context.
Also, adjust `maybe_collapse_eg()` to return `False` in the non-single
`.exceptions` case, again don't know what I was trying to do but this
simplifies caller logic and the prior return-semantic had no real
value..
This fixes some final usage in the runtime (namely top level nursery
usage in `._root`/`._runtime`) which was previously causing test suite
failures prior to this fix.
Seems to cause the following test suites to fail however..
- 'test_advanced_faults.py::test_ipc_channel_break_during_stream'
- 'test_advanced_faults.py::test_ipc_channel_break_during_stream'
- 'test_clustering.py::test_empty_mngrs_input_raises'
Also tweak some ctxc request logging content.
That is just throughout the core library, not the tests yet. Again, we
simply change over to using our (nearly equivalent?)
`.trionics.collapse_eg()` in place of the already deprecated
`strict_exception_groups=False` flag in the following internals,
- the conc-fan-out tn use in `._discovery.find_actor()`.
- `._portal.open_portal()`'s internal tn used to spawn a bg rpc-msg-loop
task.
- the daemon and "run-in-actor" layered tn pair allocated in
`._supervise._open_and_supervise_one_cancels_all_nursery()`.
The remaining loose-eg usage in `._root` and `._runtime` seem to be
necessary to keep the test suite green?? For the moment these are left
out.
- drop the (never/un)used `.get_chans()`.
- add #TODO for factoring many methods into a new `.rpc`-subsys/pkg
primitive, like an `RPCMngr/Server` type eventually.
- add todo to maybe mv `.get_parent()` elsewhere?
- move masked `._hard_mofo_kill()` to bottom.
Namely inside various bootup-sequences in `._root` and `._runtime`
particularly in the root actor to support both better tpt-address
denoting in our logging and as part of clarifying logic around setting
the root's registry addresses which is soon to be much better factored
out of the core and into an explicit subsystem + API.
Some `_root.open_root_actor()` deats,
- set `registry_addrs` to a new `uw_reg_addrs` (uw: unwrapped) to be
more explicit about wrapped addr types thoughout.
- instead ensure `registry_addrs` are the wrapped types and pass down
into the root `Actor` singleton-instance.
- factor the root-actor check + rt-vars update (updating the `'_root_addrs'`)
out of `._runtime.async_main()` into this fn.
- as previous, set `trans_bind_addrs = uw_reg_addrs` in unwrapped form since it will
be passed down both through rt-vars as `'_root_addrs'` and to
`._runtim.async_main()` as `accept_addrs` (which is then passed to the
IPC server).
- adjust/simplify much logging.
- shield the `await actor.cancel(None) # self cancel` to avoid any
finally-footguns.
- as mentioned convert the
For `_runtime.async_main()` tweaks,
- expect `registry_addrs: list[Address]|None = None` with appropriate
unwrapping prior to setting both `.reg_addrs` and the equiv rt-var.
- add a new `.registry_addrs` prop for the wrapped form.
- convert a final loose-eg for the `service_nursery` to use
`collapse_eg()`.
- simplify teardown report logging.
Mostly just multi-line code styling again: always putting standalone
`'f\n'` on separate LOC so it reads like it renders to console. Oh and
and a level drop to `.runtime()` for rx-msg reports.
Again using `Channel.aid.reprol()`, `.devx.pformat.nest_from_op()` and
converting to multi-line code style an ' for str-report-contents. Tweak
some imports to sub-mod level as well.
Again for nicer console logging. Also fix a double `req_chan` arg bug
when passed to `_invoke` in the `self.cancel()` rt-ep; don't update the
`kwargs: dict` just merge in `req_chan` input at call time.
Including a new one-line `_shutdown_msg: str` which we mod-var-set for
testing usage and some denoising at `.info()` level. Adjust `Actor()`
instantiating input to the new `.registry_addrs` wrapped addrs property.
Mostly adjusting indentation, noise level, and clarity via `.pformat()`
tweaks more general use of `.devx.pformat.nest_from_op()`.
Specific impl deats,
- use `pformat.ppfmt()/`nest_from_op()` more seriously throughout
`._server`.
- add a `._server.Endpoint.pformat()`.
- add `._server.Server.len_peers()` and `.repr_state()`.
- polish `Server.pformat()`.
- drop some redundant `log.runtime()`s from `._serve_ipc_eps()` instead
leaving-them-only/putting-them in the caller pub meth.
- `._tcp.start_listener()` log the bound addr, not the input (which may
be the 0-port.
- only generate a repr in `.from_addr()` when log level is >= 'runtime'.
|_ add a todo about supporting this optimization more generally on our
adapter.
- fix `Channel.pformat()` to show unknown peer field line fmt correctly.
- add a `Channel.maddr: str` which just delegates directly to the
`._transport` like other pass-thru property fields.
To start trying out,
- using in the `Start`-msg handler-block to repr the msg coming
*from* a `repr(Channel)` using '<=)` sclang op.
- for a completed RPC task in `_invoke_non_context()`.
- for the msg loop task's termination report.
Such as the `MsgTransport.stream` and `.drain` attrs since they're
rarely that important at the chan level. Also start adopting
a `.<attr>=` style for actual attrs of the type versus a `<name>:
` style for meta-field info lines.
To simplify `.pformat()` output when the new `privates: bool` is unset
(the default) this adds new public attrs to wrap an actor's
cancellation status as well as provide a `.repr_state: str` (similar to
our equiv on `Context`). Rework `.pformat()` to render a much simplified
repr using all these new refinements.
Further, port the `.cancel()` method to use `.msg.types.Aid` for all
internal `requesting_uid` refs (now renamed with `_aid`) and in all
called downstream methods.
New cancel-state iface deats,
- rename `._cancel_called_by_remote` -> `._cancel_called_by` and expect
it to be set as an `Aid`.
- add `.cancel_complete: bool` which flags whether `.cancel()` ran to
completion.
- add `.cancel_called: bool` which just wraps `._cancel_called` (and
which likely will just be dropped since we already have
`._cancel_called_by`).
- add `.cancel_caller: Aid|None` which wraps `._cancel_called_by`.
In terms of using `Aid` in cancel methods,
- rename vars with `_aid` suffix in `.cancel()` (and wherever else).
- change `.cancel_rpc_tasks()` input param to `req_aid: msgtypes.Aid`.
- do the same for `._cancel_task()` and (for now until we adjust its
internals as well) use the `Aid.uid` remap property when assigning
`Context._canceller`.
- adjust all log msg refs to match obvi.
Providing the legacy `.uid -> tuple` style id (since still used for the
`Actor._contexts` table) and a `repr-one-line` method `.reprol() -> str`
for rendering a compact unique actor ID summary (useful in
logging/.pformat()s at the least).
Since it's for beg filtering, the current impl should be renamed anyway;
it's not just for filtering cancelled excs.
Deats,
- added a real doc string, links to official eg docs and fixed the
return typing.
- adjust all internal imports to match.
To resolve the recently added and failing
`test_remote_exc_relay::test_unmasked_remote_exc`: never allow
`trio.Cancelled` to mask an underlying user-code exception, ever.
Our first real-world (runtime internal) use case for the new
`.trionics.maybe_raise_from_masking_exc()` such that the failing
test now passes with an properly relayed remote RTE unmasking B)
Details,
- flip the `Context._scope_nursery` to the default strict-eg behaviour
and instead stack its outer scope with a `.trionics.collapse_eg()`.
- wrap the inner-most scope (after `msgops.maybe_limit_plds()`) with
a `maybe_raise_from_masking_exc()` to ensure user-code errors are
never masked by `trio.Cancelled`s.
Some err-reporting refinement,
- always capture any `scope_err` from the entire block for debug
purposes; report it in the `finally` block's log.
- always capture any suppressed `maybe_re`, output from
`ctx.maybe_raise()`, and `log.cancel()` report it.
To handle captured non-egs (when the now optional `tn` isn't provided)
as well as yield up a `BoxedMaybeException` which contains any detected
and un-masked `exc_ctx` as its `.value`.
Also add some additional tooling,
- a `raise_unmasked: bool` toggle for when the caller just wants to
report the masked exc and not raise-it-in-place of the masker.
- `extra_note: str` which by default is tuned to the default
`unmask_from = (trio.Cancelled,)` but which can be used to deliver
custom exception msg content.
- `always_warn_on: tuple[BaseException]` which will always emit
a warning log of what would have been the raised-in-place-of
`ctx_exc`'s msg for special cases where you want to report
a masking case that might not be otherwise noticed by the runtime
(cough like a `Cancelled` masking another `Cancelled) but which
you'd still like to warn the caller about.
- factor out the masked-`ext_ctx` predicate logic into
a `find_masked_excs()` and also use it for non-eg cases.
Still maybe todo?
- rewrapping multiple masked sub-excs in an eg back into an eg? left in
#TODOs and a pause-point where applicable.
Turns out there were some subtle internal bugs discovered by the just
added `tests/devx/test_tooling::test_crash_handler_cms` suite.
So this fixes,
- a mis-ordering around `rt_repl_fixture :=` in the logic of
`DebugStatus.maybe_enter_repl_fixture()`.
- `.devx.debug._post_mortem._post_mortem()` ensuring we now **always**
call `DebugStatus.release()`, and thus unwind the exist-stack managing
the `repl_fixture` exit/teardown, **even in the case** where
`yield False` is delivered from the user-fixture-fn (meaning
`dnter_repl=False`) thus triggering an early `return` (as is done in
the new test suite).
Opting for performance over broad multi-actor "debug-ability" from
sync-function-contexts when `debug_mode=True` is set;
IOW prefer no behind-the-scenes `greenlet` perf impact over being
able to use an actor-safe `breakpoint()` wherever as per,
https://greenback.readthedocs.io/en/latest/principle.html#performance
Adjust the breakpoint restore ex script to match.
Dropping the `_maybe_open_repl_fixture()` approach and instead using
a `DebugStatus._fixture_stack = ExitStack()` which provides for much
simpler support around both sync and async pausing APIs thanks to only
invoking `repl_fixture.__exit__()` on actual `PdbREPL` interaction being
complete!
Deats,
- all `repl_fixture` detection logic still happens in one place (the new
method) but we aren't limited to closing it via an immediate post REPL
`.__exit__()` call which instead is triggered by,
- `DebugStatus.release()` which now calls `._fixture_stack.close()` and
thus only invokes `repl_fixture.__exit__()` when user REPL-ing is
**actually complete** an arbitrary amount of debugging time later.
- include the notes for `@acm` support above the new method, though not
sure if they're as relevant any more?
Benefits,
- we can drop the previously added indent levels from
`_enter_repl_sync()` and `_post_mortem()`.
- now we automatically have support for the `.pause_from_sync()` API
since `_enter_repl_sync()` doesn't close the prior
`_maybe_open_repl_fixture()` immediately when `debug_func=None`; the
user's `__exit__()` is only ever called once `.release()` is.
Other,
- add big 'CASE' comments around the various blocks in
`.pause_from_sync()`, i was having trouble figuring out which i was
using from a `breakpoint()` in a dependent app..