Duplicated channel on Actor._peers causes hang on portal.cancel_actor() #25

Merged
guille merged 1 commits from discovery_dedup into main 2025-04-13 20:53:24 +00:00
Collaborator

While working on py-leap’s state history ws reader code, I was getting hangs on exit even after all contexts where terminated between actors, we tracked down the source of the hang to be this line:

fde681fa19/tractor/_runtime.py (L1990)

Which upon further inspection revealed a duplicated channel reference on actor._peers.

The source of the duplication was caused by this function, used on the new ringbuf pub-sub actor module on branch one_ring_to_rule_them_all:

@acm
async def open_pub_channel_at(
    actor_name: str,
    token: RBToken,
    cleanup: bool = True,
):
    async with (
        tractor.find_actor(actor_name) as portal,

        portal.open_context(
            _add_pub_channel,
            token=token
        ) as (ctx, _)
    ):
        ...

    try:
        yield

    except trio.Cancelled:
        log.exception(
            'open_pub_channel_at got cancelled!\n'
            f'\tactor_name = {actor_name}\n'
            f'\ttoken = {token}\n'
        )
        raise

    finally:
        if not cleanup:
            return

        async with tractor.find_actor(actor_name) as portal:
            if portal:
                async with portal.open_context(
                    _remove_pub_channel,
                    ring_name=token.shm_name
                ) as (ctx, _):
                    ...

In particular the duplication was caused by the find_actor call, inside the get_peer_by_name helper:

fde681fa19/tractor/_discovery.py (L119)

The solution is to not append the parent channel to the to_scan data structure.

While working on `py-leap`'s state history ws reader code, I was getting hangs on exit even after all contexts where terminated between actors, we tracked down the source of the hang to be this line: https://pikers.dev/goodboy/tractor/src/commit/fde681fa193d2ea76d65a223027eff18fae4f57f/tractor/_runtime.py#L1990 Which upon further inspection revealed a duplicated channel reference on `actor._peers`. The source of the duplication was caused by this function, used on the new `ringbuf` pub-sub actor module on branch `one_ring_to_rule_them_all`: ```python @acm async def open_pub_channel_at( actor_name: str, token: RBToken, cleanup: bool = True, ): async with ( tractor.find_actor(actor_name) as portal, portal.open_context( _add_pub_channel, token=token ) as (ctx, _) ): ... try: yield except trio.Cancelled: log.exception( 'open_pub_channel_at got cancelled!\n' f'\tactor_name = {actor_name}\n' f'\ttoken = {token}\n' ) raise finally: if not cleanup: return async with tractor.find_actor(actor_name) as portal: if portal: async with portal.open_context( _remove_pub_channel, ring_name=token.shm_name ) as (ctx, _): ... ``` In particular the duplication was caused by the `find_actor` call, inside the `get_peer_by_name` helper: https://pikers.dev/goodboy/tractor/src/commit/fde681fa193d2ea76d65a223027eff18fae4f57f/tractor/_discovery.py#L119 The solution is to not append the parent channel to the `to_scan` data structure.
guille added 1 commit 2025-04-13 17:14:49 +00:00
goodboy approved these changes 2025-04-13 18:00:35 +00:00
goodboy left a comment
Owner

yup! thanks for putting this up quick!

yup! thanks for putting this up quick!
guille merged commit 4e8404bb09 into main 2025-04-13 20:53:24 +00:00
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: goodboy/tractor#25
There is no content yet.