Add `tractor.spawn._reap.unlink_uds_bind_addrs()`

Inside a new new `tractor.spawn._reap` submod which kicks off providing
post-mortem subactor cleanup primitives, parent-side; consider it the
"sibling" of `tractor._testing._reap` which is the test-harness-oriented
brother mod.

Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454
where `hard_kill()`'s `SIGKILL` bypasses the subactor's
`_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files..

This adds 2 cleanup paths:
- explicit `bind_addrs` (when set at spawn time),
OR
- convention-based reconstruction from `subactor.aid.name + proc.pid`
  for the random-self-assign case.

`.spawn.hard_kill()` now invokes the cleanup unconditionally
post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError`
skip.

Future work — authoritative tracking via a per-process
UDS bind-addr registry — documented in module docstring,
deferred to a follow-up PR.

Co-fix: `tractor/spawn/_trio.py::new_proc` already passes
`bind_addrs` + `subactor` to `hard_kill` via prior work
on this branch.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-05-04 10:51:01 -04:00
parent 5cf0312c78
commit e9712dcaeb
4 changed files with 243 additions and 11 deletions

View File

@ -231,7 +231,8 @@ def find_runaway_subactors(
`cpu_threshold` (default 95%) the smoking-gun
signature of a runaway tight-loop bug (e.g. a C-level
`recvfrom` loop on a closed socket that missed EOF
detection; #452-class issue).
detection see
`ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`).
`cpu_percent(interval=sample_interval)` is the
canonical psutil API for a "what %CPU is this proc
@ -737,7 +738,7 @@ def _track_orphaned_uds_per_test():
`wait_for_actor`/`find_actor` discovery probes can
accidentally hit (FileExistsError on rebind, or
epoll register on a half-closed peer-FIN'd fd → see
issue #452). Catching the leak the test that caused
issue #454). Catching the leak the test that caused
it (vs. blanket session-end sweep) makes blame
obvious + prevents cascade flakiness.
@ -813,10 +814,12 @@ def _detect_runaway_subactors_per_test():
**Does NOT kill the runaway** by design.
The point of this fixture is to make tight-loop bugs
(e.g. C-level `recvfrom` loop on a closed socket
that missed EOF detection issue #452-class) loudly
visible AT the test that triggers, while keeping
the live pid available for hands-on diagnosis. The
session-end `_reap_orphaned_subactors` fixture will
that missed EOF detection see
`ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`)
loudly visible AT the test that triggers, while
keeping the live pid available for hands-on
diagnosis. The session-end
`_reap_orphaned_subactors` fixture will
SIGINT-then-SIGKILL any survivors when the test
session completes normally; if the user Ctrl-C's
pytest mid-warning, the pid stays alive for as long
@ -892,8 +895,10 @@ def _detect_runaway_subactors_per_test():
# by THIS test that survived a normal teardown
# (i.e. parent's `hard_kill` SIGKILL didn't actually
# stop the runaway because it was in C tight-loop
# somewhere unreachable to signals — see issue #452
# forkserver-worker post-fork-close gap).
# somewhere unreachable to signals — see
# `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
# for the canonical fork-spawn forkserver-worker
# post-fork-close gap).
post_runaways: list[tuple[int, float, str]] = (
find_runaway_subactors(
parent_pid,

View File

@ -0,0 +1,183 @@
# tractor: structured concurrent "actors".
# Copyright 2018-eternity Tyler Goodlet.
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
'''
Post-mortem subactor cleanup primitives things the parent
runtime has to clean up because the dead-or-SIGKILL'd child
couldn't.
Sibling of `tractor._testing._reap` which is the test-harness
equivalent (orphan-pid + leaked-shm + leaked-UDS-sock sweeper
fixtures). This module is the spawn-layer counterpart, called
inline from `hard_kill` and the broader subactor reap path.
Today this is just `unlink_uds_bind_addrs()`. As future
post-mortem cleanup needs surface (e.g. `/dev/shm` segment
unlink for hard-crashed actors, leaked-pidfile cleanup), they
land here too.
Future-work TODO authoritative UDS bind-addr tracking
-------------------------------------------------------
`unlink_uds_bind_addrs()` currently has two cleanup paths:
1. Explicit `bind_addrs` (when parent set them at spawn time)
2. **Convention-based reconstruction**
`<XDG_RUNTIME_DIR>/tractor/<name>@<pid>.sock` for the
common case where the subactor self-assigned a random sock
via `UDSAddress.get_random()`.
Path (2) hardcodes the `<name>@<pid>.sock` convention from
`tractor.ipc._uds.UDSAddress`. If that convention ever
changes or the subactor binds to a non-default
`bindspace`/`filedir` we'll silently fail to unlink.
A more authoritative approach would be:
- Subactors register their bound UDS sockpaths in a
per-process registry inside `tractor.ipc._uds` at
`start_listener()` time.
- The subactor reports its bound sockpath(s) back to the
parent over IPC immediately post-bind (extension to
`SpawnSpec` reply / a new handshake msg).
- Parent caches the subactor's authoritative sockpaths.
- `unlink_uds_bind_addrs()` checks the cache FIRST, falls
back to convention-reconstruction if the subactor died
before reporting (which is the SIGKILL case this fn
primarily exists for).
Tracked as future work in #454 (the parent UDS-leak
issue this module addresses); a separate issue may be
filed if/when the registry impl is scoped.
See also #452 — the discovery-client `CLOSE_WAIT` TCP
fd leak. Different bug class but same broader theme of
"fork-spawn unmasked latent cleanup gaps".
'''
from __future__ import annotations
import os
from typing import TYPE_CHECKING
import trio
from tractor.discovery._addr import (
UnwrappedAddress,
wrap_address,
)
from tractor.ipc._uds import UDSAddress
from tractor.log import get_logger
if TYPE_CHECKING:
from tractor.runtime._runtime import Actor
log = get_logger('tractor')
def unlink_uds_bind_addrs(
proc: trio.Process,
*,
bind_addrs: list[UnwrappedAddress] | None = None,
subactor: Actor | None = None,
) -> None:
'''
Best-effort post-mortem cleanup of any UDS sock-files
a hard-killed subactor was bound to.
SIGKILL bypasses Python execution the subactor's
`_serve_ipc_eps` `finally:` block (which normally calls
`os.unlink(addr.sockpath)`) never runs. Without this
parent-side cleanup, the dead subactor's
`${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` file
accumulates on the filesystem (see issue #454 + the
autouse `_track_orphaned_uds_per_test` fixture).
Two cleanup paths, in order:
1. **Explicit `bind_addrs`** when the parent set the
subactor's bind addrs at spawn time, unlink each
UDS-flavored sockpath directly.
2. **Self-assigned reconstruction** when
`bind_addrs` is empty (the common case: subactor
picked its own random sock via
`UDSAddress.get_random()`), reconstruct the path
from `(subactor.aid.name, proc.pid)` using the
same `<name>@<pid>.sock` convention. We can do this
because the subactor uses its OWN `os.getpid()` at
bind time, which equals `proc.pid` from the
parent's view.
Idempotent: `FileNotFoundError` (graceful exit
already-unlinked, or sock never bound under early-
spawn cancel) is silenced; other `OSError`s log a
warning but never raise. TCP / non-UDS bind addrs are
skipped.
'''
sockpaths: list[str] = []
# path 1: explicit bind_addrs set at spawn time
for unwrapped in (bind_addrs or ()):
try:
addr = wrap_address(unwrapped)
except Exception:
log.exception(
f'Failed to wrap addr for UDS post-kill cleanup '
f'— skipping {unwrapped!r}\n'
)
continue
if isinstance(addr, UDSAddress):
sockpaths.append(str(addr.sockpath))
# path 2: reconstruct from subactor name + proc pid
# for the random-self-assign case (bind_addrs=None)
#
# TODO authoritative tracking — see module docstring.
if (
not sockpaths
and subactor is not None
and proc.pid is not None
):
sockname: str = f'{subactor.aid.name}@{proc.pid}.sock'
sockpath: str = str(
UDSAddress.def_bindspace / sockname
)
sockpaths.append(sockpath)
for sockpath in sockpaths:
try:
os.unlink(sockpath)
log.runtime(
f'Unlinked orphaned UDS sock-file post-SIGKILL\n'
f' |_{proc}\n'
f' |_{sockpath}\n'
)
except FileNotFoundError:
# raced — subactor cleaned up before SIGKILL,
# OR sockfile never bound (early-spawn cancel),
# OR transport wasn't UDS this run.
pass
except OSError as exc:
log.warning(
f'Failed to unlink subactor UDS sock-file '
f'post-SIGKILL\n'
f' |_{proc}\n'
f' |_{sockpath}\n'
f' |_{exc!r}\n'
)

View File

@ -40,7 +40,10 @@ from tractor.runtime._state import (
_runtime_vars,
)
from tractor.log import get_logger
from tractor.discovery._addr import UnwrappedAddress
from tractor.discovery._addr import (
UnwrappedAddress,
)
from ._reap import unlink_uds_bind_addrs
from tractor.runtime._portal import Portal
from tractor.runtime._runtime import Actor
from tractor.msg import types as msgtypes
@ -279,6 +282,16 @@ async def hard_kill(
# whilst also hacking on it XD
# terminate_after: int = 99999,
*,
# Subactor's bind addresses + subactor record, used
# for post-SIGKILL UDS sockpath cleanup. Optional for
# legacy callers; new call sites should pass at least
# `subactor` (which lets us reconstruct the sock path
# from `aid.name + proc.pid` when `bind_addrs` is
# empty/self-assigned). See `._reap.unlink_uds_bind_addrs()`.
bind_addrs: list[UnwrappedAddress] | None = None,
subactor: Actor | None = None,
) -> None:
'''
Un-gracefully terminate an OS level `trio.Process` after timeout.
@ -366,6 +379,21 @@ async def hard_kill(
)
proc.kill()
# Post-mortem UDS sockpath cleanup. SIGKILL bypassed
# the subactor's normal `os.unlink(addr.sockpath)` in
# `_serve_ipc_eps`'s `finally:`; the parent has the
# bind addrs (or can reconstruct from name + pid) so
# we do it here. Runs UNCONDITIONALLY (graceful-exit
# case is a no-op via `FileNotFoundError` skip in the
# helper) so the cleanup also covers the "cancelled
# during spawn" path where the subactor never reached
# its IPC server finally block.
unlink_uds_bind_addrs(
proc,
bind_addrs=bind_addrs,
subactor=subactor,
)
async def soft_kill(
proc: ProcessType,

View File

@ -39,7 +39,7 @@ from tractor.runtime._state import (
current_actor,
is_root_process,
debug_mode,
get_runtime_vars,
# get_runtime_vars,
)
from tractor.log import get_logger
from tractor.discovery._addr import UnwrappedAddress
@ -282,7 +282,23 @@ async def trio_proc(
if proc.poll() is None:
log.cancel(f"Attempting to hard kill {proc}")
await hard_kill(proc)
await hard_kill(
proc,
# NOTE, pass through so post-SIGKILL we
# can `os.unlink()` the subactor's
# orphaned UDS sock-file(s) — the
# subactor's own
# `_serve_ipc_eps`-`finally:` cleanup
# never runs under SIGKILL. `subactor`
# lets the helper reconstruct the
# sock path via `aid.name + proc.pid`
# when `bind_addrs` is the common
# self-assigned-random case
# (bind_addrs=None at spawn). See
# `_unlink_uds_bind_addrs()` in `_spawn`.
bind_addrs=bind_addrs,
subactor=subactor,
)
log.debug(f"Joined {proc}")
else: