Add `--shm` orphan sweep to `tractor-reap`

Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.

Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
  + `reap_shm()` helpers. Match criteria: regular file
  under `/dev/shm`, owned by current uid, AND no live
  proc has it open (mmap'd or fd-held). In-use
  enumeration via `psutil.Process.memory_maps()` +
  `.open_files()` — xplatform, kernel-canonical (same
  answer `lsof` would give), no reliance on
  tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
  `NotImplementedError` outside Linux/FreeBSD bc macOS
  POSIX shm has no fs-visible path (`shm_open` only)
  and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
  process reap) and `--shm-only` (skip process phase)
  flags. `-n` dry-runs both phases. Exit code is `1`
  if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
  the `testing` dep group; lazy-imported in `_reap.py`
  so the process-reap path stays import-clean without
  it.

Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
  (new section 10c) — covers match criteria + the
  preservation guarantee for unrelated apps.
- flip mitigation status in
  `subint_forkserver_mp_shared_memory_issue.md` from
  "could extend `tractor-reap`" to "implemented", with
  a note that callers should still UUID-pin shm keys to
  avoid cross-session collisions.

Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-04-27 11:35:33 -04:00
parent aa3e230926
commit 4f12d69b41
6 changed files with 385 additions and 44 deletions

View File

@ -585,3 +585,41 @@ to force-reap under a still-live supervisor.
active in another terminal. It's safe (won't touch active in another terminal. It's safe (won't touch
that session's live children in orphan-mode) but can that session's live children in orphan-mode) but can
race if the target session is mid-teardown. race if the target session is mid-teardown.
### c) `--shm` / `--shm-only`: orphan-segment sweep
Because `tractor.ipc._mp_bs.disable_mantracker()`
turns off `mp.resource_tracker` (see
`ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md`),
a hard-crashing actor can leave `/dev/shm/<key>`
segments behind that nothing else GCs.
```sh
# process reap THEN shm sweep
scripts/tractor-reap --shm
# shm sweep only (skip process phase)
scripts/tractor-reap --shm-only
# dry-run: list candidates, don't unlink
scripts/tractor-reap --shm -n
```
**Match criteria** (very conservative — this is a
shared-system path, can't be wrong):
- segment is a regular file under `/dev/shm`,
- owned by the **current uid** (`stat.st_uid`),
- AND **no live process holds it open**
enumerated by walking every readable
`/proc/<pid>/maps` (post-mmap mappings) AND
`/proc/<pid>/fd/*` (pre-mmap shm-opened fds).
The "nobody has it open" check is the
kernel-canonical "is this leaked?" test — same
answer `lsof /dev/shm/<key>` would give. No
reliance on tractor-specific naming, so it works
for any tractor app. Critically, it WILL NOT touch
segments held by other apps you have running
(e.g. `piker`, `lttng-ust-*`, `aja-shm-*`
verified locally with 81 in-use segments correctly
preserved).

View File

@ -132,14 +132,20 @@ segment (legitimate race in shared-key setups).
- **Crash-leaked segments.** If an actor segfaults - **Crash-leaked segments.** If an actor segfaults
or is `SIGKILL`'d before its lifetime stack runs, or is `SIGKILL`'d before its lifetime stack runs,
`/dev/shm/<key>` will leak. Mitigations: `/dev/shm/<key>` will leak. Mitigation:
- `tractor-reap` (the new `scripts/tractor-reap --shm` walks `/dev/shm`,
`scripts/tractor-reap` CLI) doesn't yet sweep filters to segments owned by the current uid that
`/dev/shm` — could extend it. no live process is mapping or holding open (via
- Higher-level apps using shm should pin a UUID `/proc/*/maps` + `/proc/*/fd/*`), and unlinks
into the key (the `'shml_<uuid>'` pattern in them. The "nobody-has-it-open" filter is
`test_child_attaches_alot`) so leaks are kernel-canonical so it never touches in-flight
distinct per session and easy to GC out-of-band. segments held by sibling apps (verified locally
against 81 piker/lttng/aja-held segments — all
preserved).
- Higher-level apps using shm should still pin a
UUID into the key (the `'shml_<uuid>'` pattern
in `test_child_attaches_alot`) so concurrent
sessions don't collide on the same key.
- **Cross-actor unlink races.** Two actors holding - **Cross-actor unlink races.** Two actors holding
the same shm key racing on `unlink()` — handled the same shm key racing on `unlink()` — handled
by the `FileNotFoundError` swallow. by the `FileNotFoundError` swallow.

View File

@ -84,6 +84,11 @@ testing = [
# known-hanging `subint`-backend audit tests; see # known-hanging `subint`-backend audit tests; see
# `ai/conc-anal/subint_*_issue.md`). # `ai/conc-anal/subint_*_issue.md`).
"pytest-timeout>=2.3", "pytest-timeout>=2.3",
# used by `tractor._testing._reap` for the
# `tractor-reap` zombie-subactor + leaked-shm
# cleanup utility (xplatform `Process.memory_maps`,
# `Process.open_files`).
"psutil>=7.0.0",
] ]
repl = [ repl = [
"pyperclip>=1.9.0", "pyperclip>=1.9.0",

View File

@ -4,14 +4,26 @@
# #
# SPDX-License-Identifier: AGPL-3.0-or-later # SPDX-License-Identifier: AGPL-3.0-or-later
''' '''
`tractor-reap` — SC-polite zombie-subactor reaper. `tractor-reap` — SC-polite zombie-subactor reaper +
optional `/dev/shm/` orphan-segment sweep.
Finds `tractor` subactor processes left alive after a Two cleanup phases (run in order when both are enabled):
`pytest` (or any tractor-app) run that failed to fully
cancel its actor tree, then sends SIGINT with a bounded
grace window before escalating to SIGKILL.
Detection modes (auto-selected): 1. **process reap** — finds `tractor` subactor processes
left alive after a `pytest` (or any tractor-app) run
that failed to fully cancel its actor tree, then sends
SIGINT with a bounded grace window before escalating
to SIGKILL.
2. **shm sweep** (`--shm` / `--shm-only`) — unlinks
`/dev/shm/<file>` entries owned by the current uid
that no live process has open (mmap'd or fd-held).
Needed because `tractor` disables
`mp.resource_tracker` (see `tractor.ipc._mp_bs`), so a
hard-crashing actor leaves leaked segments that
nothing else GCs.
Process-reap detection modes (auto-selected):
--parent <pid> : descendant-mode — kill procs whose --parent <pid> : descendant-mode — kill procs whose
PPid == <pid>. Use when a parent PPid == <pid>. Use when a parent
@ -29,14 +41,21 @@ Detection modes (auto-selected):
Usage: Usage:
# after a pytest run crashed/was Ctrl+C'd # process reap only (default)
scripts/tractor-reap scripts/tractor-reap
# process reap + shm sweep
scripts/tractor-reap --shm
# only the shm sweep, skip process reap
scripts/tractor-reap --shm-only
# from inside a still-live supervisor # from inside a still-live supervisor
scripts/tractor-reap --parent 12345 scripts/tractor-reap --parent 12345
# dry-run: list what would be reaped, don't signal # dry-run: list what would be reaped, don't act
scripts/tractor-reap -n scripts/tractor-reap -n
scripts/tractor-reap --shm -n
''' '''
import argparse import argparse
@ -83,7 +102,21 @@ def main() -> int:
parser.add_argument( parser.add_argument(
'--dry-run', '-n', '--dry-run', '-n',
action='store_true', action='store_true',
help='list matched pids but do not signal', help='list matched pids/paths but do not signal/unlink',
)
parser.add_argument(
'--shm',
action='store_true',
help=(
'after process reap, also unlink orphaned '
'/dev/shm segments owned by the current user '
'that no live process is mapping or holding open'
),
)
parser.add_argument(
'--shm-only',
action='store_true',
help='skip process reap; only do the shm sweep',
) )
args = parser.parse_args() args = parser.parse_args()
@ -95,9 +128,15 @@ def main() -> int:
from tractor._testing._reap import ( from tractor._testing._reap import (
find_descendants, find_descendants,
find_orphans, find_orphans,
find_orphaned_shm,
reap, reap,
reap_shm,
) )
rc: int = 0
# --- phase 1: process reap (skipped under --shm-only) ---
if not args.shm_only:
if args.parent is not None: if args.parent is not None:
pids: list[int] = find_descendants(args.parent) pids: list[int] = find_descendants(args.parent)
mode: str = f'descendants of PPid={args.parent}' mode: str = f'descendants of PPid={args.parent}'
@ -107,17 +146,36 @@ def main() -> int:
if not pids: if not pids:
print(f'[tractor-reap] no {mode} to reap') print(f'[tractor-reap] no {mode} to reap')
return 0 elif args.dry_run:
print(
f'[tractor-reap] dry-run — {mode}:\n {pids}'
)
else:
_, survivors = reap(pids, grace=args.grace)
if survivors:
rc = 1
if args.dry_run: # --- phase 2: shm sweep (opt-in) ---
print(f'[tractor-reap] dry-run — {mode}:\n {pids}') if args.shm or args.shm_only:
return 0 leaked: list[str] = find_orphaned_shm()
if not leaked:
print(
'[tractor-reap] no orphaned /dev/shm '
'segments to sweep'
)
elif args.dry_run:
print(
f'[tractor-reap] dry-run — {len(leaked)} '
f'orphaned shm segment(s):\n {leaked}'
)
else:
_, errors = reap_shm(leaked)
if errors:
rc = 1
signalled, survivors = reap(pids, grace=args.grace) # exit 0 if everything cleaned cleanly, else 1 — useful
# exit 0 if everyone exited cleanly, else 1 to signal # for CI health-check chaining.
# escalation happened — makes the command useful in return rc
# CI health-checks and `||`-chaining.
return 0 if not survivors else 1
if __name__ == '__main__': if __name__ == '__main__':

View File

@ -16,17 +16,25 @@
''' '''
Zombie-subactor reaper SC-polite (SIGINT first, SIGKILL Zombie-subactor reaper SC-polite (SIGINT first, SIGKILL
as last resort with a bounded grace window). as last resort with a bounded grace window) plus optional
`/dev/shm/` orphan-segment sweep.
Shared implementation between the `tractor-reap` CLI Shared implementation between the `tractor-reap` CLI
(`scripts/tractor-reap`) and the pytest session-scoped (`scripts/tractor-reap`) and the pytest session-scoped
auto-fixture that guards the test suite against leftover auto-fixture that guards the test suite against leftover
subactor processes. subactor processes.
Design notes Design notes process reap
------------ ---------------------------
- Linux-only today: reads `/proc/<pid>/{status,cwd,cmdline}`.
Module imports cleanly elsewhere; calling `find_*` on a
non-Linux box returns an empty list (no `/proc`
enumeration). A future xplatform pass could swap this
for `psutil.Process.children()` /
`psutil.process_iter()` since `psutil` is already a
test-time dependency.
- Linux-only: reads `/proc/<pid>/{status,cwd,cmdline}`.
- Two detection modes: - Two detection modes:
1. **descendant-mode** when invoked from a still-live 1. **descendant-mode** when invoked from a still-live
@ -49,14 +57,71 @@ Design notes
we want the subactor runtime to run its trio cancel we want the subactor runtime to run its trio cancel
shield + IPC teardown paths where it can. shield + IPC teardown paths where it can.
Design notes shm sweep
------------------------
Since `tractor/ipc/_mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely, a hard-crashing actor can
leave `/dev/shm/<key>` segments behind that nothing else
GCs (see
`ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md`,
"Trade-offs / known gaps").
The shm sweep is **Linux-/FreeBSD-only**: both expose
POSIX shared-memory segments as regular files under
`/dev/shm`, so `os.stat()` + `os.unlink()` are the
correct primitives. macOS POSIX shm has no fs-visible
path (segments live behind `shm_open`/`shm_unlink`
syscalls only), and Windows is a different story
entirely. Calling the shm helpers on an unsupported
platform raises `NotImplementedError`.
In-use enumeration delegates to `psutil`
`Process.memory_maps()` (post-mmap) +
`Process.open_files()` (pre-mmap shm-opened fds)
xplatform, mature, and handles the per-process
permission/race edge cases correctly. Segments matching
neither are genuinely leaked safe to unlink.
The "nobody has it open" check is the kernel-canonical
test same answer `lsof /dev/shm/<key>` would give. No
reliance on tractor-specific naming conventions (shm
keys are caller-defined).
''' '''
from __future__ import annotations from __future__ import annotations
import os import os
import pathlib import pathlib
import signal import signal
import stat
import sys
import time import time
# `/dev/shm` is the POSIX-shm filesystem on Linux + FreeBSD.
# macOS uses `shm_open` syscalls without a fs-visible path,
# so the shm helpers refuse to run there.
_SHM_PLATFORM_OK: bool = sys.platform.startswith(
('linux', 'freebsd')
)
SHM_DIR: str = '/dev/shm'
def _ensure_shm_supported() -> None:
'''
Guard for shm helpers they assume `/dev/shm` exists
as a tmpfs and `os.unlink()` is the right primitive.
Both true on Linux + FreeBSD; not true elsewhere.
'''
if not _SHM_PLATFORM_OK:
raise NotImplementedError(
f'shm reap is only supported on Linux/FreeBSD; '
f'got sys.platform={sys.platform!r}. macOS '
f'POSIX shm has no fs-visible path; Windows '
f'has no /dev/shm equivalent.'
)
def _read_status_ppid(pid: int) -> int | None: def _read_status_ppid(pid: int) -> int | None:
''' '''
@ -69,7 +134,11 @@ def _read_status_ppid(pid: int) -> int | None:
for line in f: for line in f:
if line.startswith('PPid:'): if line.startswith('PPid:'):
return int(line.split()[1]) return int(line.split()[1])
except (FileNotFoundError, PermissionError, ProcessLookupError): except (
FileNotFoundError,
PermissionError,
ProcessLookupError,
):
return None return None
return None return None
@ -77,21 +146,32 @@ def _read_status_ppid(pid: int) -> int | None:
def _read_cwd(pid: int) -> str | None: def _read_cwd(pid: int) -> str | None:
try: try:
return os.readlink(f'/proc/{pid}/cwd') return os.readlink(f'/proc/{pid}/cwd')
except (FileNotFoundError, PermissionError, ProcessLookupError): except (
FileNotFoundError,
PermissionError,
ProcessLookupError,
):
return None return None
def _read_cmdline(pid: int) -> str: def _read_cmdline(pid: int) -> str:
try: try:
with open(f'/proc/{pid}/cmdline', 'rb') as f: with open(f'/proc/{pid}/cmdline', 'rb') as f:
return f.read().replace(b'\0', b' ').decode(errors='replace') return f.read().replace(b'\0', b' ').decode(
except (FileNotFoundError, PermissionError, ProcessLookupError): errors='replace',
)
except (
FileNotFoundError,
PermissionError,
ProcessLookupError,
):
return '' return ''
def _iter_live_pids() -> list[int]: def _iter_live_pids() -> list[int]:
''' '''
Enumerate currently-alive pids from `/proc`. Enumerate currently-alive pids from `/proc`. Returns
`[]` on systems without `/proc` (e.g. macOS).
''' '''
try: try:
@ -225,6 +305,158 @@ def _is_alive(pid: int) -> bool:
if line.startswith('State:'): if line.startswith('State:'):
# e.g. 'State:\tZ (zombie)' # e.g. 'State:\tZ (zombie)'
return 'Z' not in line.split()[1] return 'Z' not in line.split()[1]
except (FileNotFoundError, ProcessLookupError): except (
FileNotFoundError,
ProcessLookupError,
):
return False return False
return True return True
def _enumerate_in_use_shm(
shm_dir: str = SHM_DIR,
) -> set[str]:
'''
Return the set of `<shm_dir>/<file>` paths currently
held open by any live process via `psutil`'s
xplatform `Process.memory_maps()` (post-mmap
segments) and `Process.open_files()` (pre-mmap
shm-opened fds).
Lazy-imports `psutil` so the module stays importable
on installs without it (it's a `testing` group dep).
'''
_ensure_shm_supported()
# lazy + actionable failure: leaked shm sweep is the
# only thing in this module that needs psutil; we
# don't want a top-level ImportError breaking the
# process-reap path.
try:
import psutil
except ImportError as exc:
raise RuntimeError(
'shm reap requires `psutil` — install the '
'`testing` dep group, e.g. '
'`uv sync --group testing`.'
) from exc
in_use: set[str] = set()
prefix: str = shm_dir.rstrip('/') + '/'
for proc in psutil.process_iter(['pid']):
try:
for m in proc.memory_maps(grouped=False):
if m.path.startswith(prefix):
in_use.add(m.path)
for f in proc.open_files():
if f.path.startswith(prefix):
in_use.add(f.path)
except (
psutil.NoSuchProcess,
psutil.AccessDenied,
psutil.ZombieProcess,
FileNotFoundError,
PermissionError,
):
# raced — proc died or we can't see its
# mappings (e.g. root-owned). Skip; missing
# an in-use entry only means we'd preserve
# something we could reap, never the
# reverse — safe-by-default.
continue
return in_use
def find_orphaned_shm(
*,
uid: int | None = None,
shm_dir: str = SHM_DIR,
) -> list[str]:
'''
`<shm_dir>/<file>` paths that are:
- owned by `uid` (default: the current effective uid),
- and currently held by NO live process i.e.
genuinely leaked.
Linux/FreeBSD only see module docstring. No reliance
on caller-defined shm-key naming, so this works for
any tractor app (not just the test suite).
'''
_ensure_shm_supported()
if uid is None:
uid = os.geteuid()
try:
entries: list[str] = os.listdir(shm_dir)
except OSError:
return []
in_use: set[str] = _enumerate_in_use_shm(shm_dir=shm_dir)
leaked: list[str] = []
prefix: str = shm_dir.rstrip('/') + '/'
for entry in entries:
path: str = prefix + entry
try:
st: os.stat_result = os.stat(path)
except OSError:
continue
# only regular files — skip subdirs / sockets etc.
if not stat.S_ISREG(st.st_mode):
continue
if st.st_uid != uid:
continue
if path in in_use:
continue
leaked.append(path)
return leaked
def reap_shm(
paths: list[str],
*,
log=print,
) -> tuple[list[str], list[tuple[str, OSError]]]:
'''
Unlink the given `/dev/shm/...` paths.
Linux/FreeBSD only `os.unlink()` is the correct
primitive on the POSIX-shm tmpfs there. macOS POSIX
shm has no fs-visible path; the equivalent there is
`posix_ipc.unlink_shared_memory(name)` (not
implemented here see module docstring).
Returns `(unlinked, errors)` where `errors` is a list
of `(path, exc)` for paths that could not be removed
(e.g. permissions). Paths that raced to being already-
gone are counted as successfully unlinked.
'''
_ensure_shm_supported()
unlinked: list[str] = []
errors: list[tuple[str, OSError]] = []
for path in paths:
try:
os.unlink(path)
unlinked.append(path)
except FileNotFoundError:
# raced — already gone, treat as success
unlinked.append(path)
except OSError as exc:
errors.append((path, exc))
if unlinked:
log(
f'[tractor-reap] unlinked {len(unlinked)} '
f'orphaned shm segment(s): {unlinked}'
)
for path, exc in errors:
log(
f'[tractor-reap] could not unlink {path}: '
f'{exc!r}'
)
return (unlinked, errors)

View File

@ -716,6 +716,7 @@ sync-pause = [
] ]
testing = [ testing = [
{ name = "pexpect" }, { name = "pexpect" },
{ name = "psutil" },
{ name = "pytest" }, { name = "pytest" },
{ name = "pytest-timeout" }, { name = "pytest-timeout" },
] ]
@ -761,6 +762,7 @@ subints = [{ name = "msgspec", marker = "python_full_version >= '3.14'", specifi
sync-pause = [{ name = "greenback", marker = "python_full_version == '3.13.*'", specifier = ">=1.2.1,<2" }] sync-pause = [{ name = "greenback", marker = "python_full_version == '3.13.*'", specifier = ">=1.2.1,<2" }]
testing = [ testing = [
{ name = "pexpect", specifier = ">=4.9.0,<5" }, { name = "pexpect", specifier = ">=4.9.0,<5" },
{ name = "psutil", specifier = ">=7.0.0" },
{ name = "pytest", specifier = ">=8.3.5" }, { name = "pytest", specifier = ">=8.3.5" },
{ name = "pytest-timeout", specifier = ">=2.3" }, { name = "pytest-timeout", specifier = ">=2.3" },
] ]