Since `tractor.ipc._mp_bs.disable_mantracker()` turns off
`mp.resource_tracker` entirely (see the conc-anal doc
`subint_forkserver_mp_shared_memory_issue.md`), a
hard-crashing actor can leave `/dev/shm/<key>` segments
that nothing else GCs. New `tractor-reap` phase 2 sweeps
them.
Deats,
- `tractor/_testing/_reap.py`: add `find_orphaned_shm()`
+ `reap_shm()` helpers. Match criteria: regular file
under `/dev/shm`, owned by current uid, AND no live
proc has it open (mmap'd or fd-held). In-use
enumeration via `psutil.Process.memory_maps()` +
`.open_files()` — xplatform, kernel-canonical (same
answer `lsof` would give), no reliance on
tractor-specific shm-key naming.
- `_ensure_shm_supported()` guard: helpers raise
`NotImplementedError` outside Linux/FreeBSD bc macOS
POSIX shm has no fs-visible path (`shm_open` only)
and Windows is a different story.
- `scripts/tractor-reap`: new `--shm` (run after
process reap) and `--shm-only` (skip process phase)
flags. `-n` dry-runs both phases. Exit code is `1`
if either phase had survivors/errors.
- `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to
the `testing` dep group; lazy-imported in `_reap.py`
so the process-reap path stays import-clean without
it.
Also,
- doc `--shm` in `.claude/skills/run-tests/SKILL.md`
(new section 10c) — covers match criteria + the
preservation guarantee for unrelated apps.
- flip mitigation status in
`subint_forkserver_mp_shared_memory_issue.md` from
"could extend `tractor-reap`" to "implemented", with
a note that callers should still UUID-pin shm keys to
avoid cross-session collisions.
Verified locally vs 81 in-use segments held by `piker`,
`lttng-ust-*`, `aja-shm-*` — all preserved; only the
genuinely-orphaned tractor segments got unlinked.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:
- orphan-mode (default): finds PPid==1 procs
with cwd matching repo root + `python` in
cmdline.
- descendant-mode (`--parent <pid>`): scoped
sweep under a still-live supervisor.
SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).
Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Encode the hard-won lesson from the forkserver
cancel-cascade investigation into two skill docs
so future sessions grep-find it before spelunking
into trio internals.
Deats,
- `.claude/skills/conc-anal/SKILL.md`:
- new "Unbounded waits in cleanup paths"
section — rule: bound every `await X.wait()`
in cleanup paths with `trio.move_on_after()`
unless the setter is unconditionally
reachable. Recent example:
`ipc_server.wait_for_no_more_peers()` in
`async_main`'s finally (was unbounded,
deadlocked when any peer handler stuck)
- new "The capture-pipe-fill hang pattern"
section — mechanism, grep-pointers to the
existing `conftest.py` guards (`tests/conftest
.py:258`, `:316`), cross-ref to the full
post-mortem doc, and the grep-note: "if a
multi-subproc tractor test hangs, `pytest -s`
first, conc-anal second"
- `.claude/skills/run-tests/SKILL.md`: new
"Section 9: The pytest-capture hang pattern
(CHECK THIS FIRST)" with symptom / cause /
pre-existing guards to grep / three-step debug
recipe (try `-s`, lower loglevel, redirect
stdout/stderr) / signature of this bug vs. a
real code hang / historical reference
Cost several investigation sessions before the
capture-pipe issue surfaced — it was masked by
deeper cascade deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume. Lesson: **grep this
pattern first when any multi-subproc tractor test
hangs under default pytest but passes with `-s`.**
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The previous cleanup recipe went straight to
SIGTERM+SIGKILL, which hides bugs: tractor is
structured concurrent — `_trio_main` catches SIGINT
as an OS-cancel and cascades `Portal.cancel_actor`
over IPC to every descendant. So a graceful SIGINT
exercises the actual SC teardown path; if it hangs,
that's a real bug to file (the forkserver `:1616`
zombie was originally suspected to be one of these
but turned out to be a teardown gap in
`_ForkedProc.kill()` instead).
Deats,
- step 1: `pkill -INT` scoped to `$(pwd)/py*` — no
sleep yet, just send the signal
- step 2: bounded wait loop (10 × 0.3s = ~3s) using
`pgrep` to poll for exit. Loop breaks early on
clean exit
- step 3: `pkill -9` only if graceful timed out, w/
a logged escalation msg so it's obvious when SC
teardown didn't complete
- step 4: same SIGINT-first ladder for the rare
`:1616`-holding zombie that doesn't match the
cmdline pattern (find PID via `ss -tlnp`, then
`kill -INT NNNN; sleep 1; kill -9 NNNN`)
- steps 5-6: UDS-socket `rm -f` + re-verify
unchanged
Goal: surface real teardown bugs through the test-
cleanup workflow instead of papering over them with
`-9`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Fork-based backends (esp. `subint_forkserver`) can
leak child actor processes on cancelled / SIGINT'd
test runs; the zombies keep the tractor default
registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`)
bound, so every subsequent session can't bind and
50+ unrelated tests fail with the same
`TooSlowError` / "address in use" signature. Document
the pre-flight + post-cancel check as a mandatory
step 4.
Deats,
- **primary signal**: `ss -tlnp | grep ':1616'` for a
bound TCP registry listener — the authoritative
check since :1616 is unique to our runtime
- `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.*
_actor_child_main|subint-forkserv` for leftover
actor/forkserver procs — scoped deliberately so we
don't false-flag legit long-running tractor-
embedding apps like `piker`
- `ls /tmp/registry@*.sock` for stale UDS sockets
- scoped cleanup recipe (SIGTERM + SIGKILL sweep
using the same `$(pwd)/py*` pattern, UDS `rm -f`,
re-verify) plus a fallback for when a zombie holds
:1616 but doesn't match the pattern: `ss -tlnp` →
kill by PID
- explicit false-positive warning calling out the
`piker` case (`~/repos/piker/py*/bin/python3 -m
tractor._child ...`) so a bare `pgrep` doesn't lead
to nuking unrelated apps
Goal: short-circuit the "spelunking into test code"
rabbit-hole when the real cause is just a leaked PID
from a prior session, without collateral damage to
other tractor-embedding projects on the same box.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Rework section 3 from a worktree-only check into a
structured 3-step flow: detect active venv, interpret
results (Case A: active, B: none, C: worktree), then
run import + collection checks.
Deats,
- Case B prompts via `AskUserQuestion` when no venv
is detected, offering `uv sync` or manual activate
- add `uv run` fallback section for envs where venv
activation isn't practical
- new allowed-tools: `uv run python`, `uv run pytest`,
`uv pip show`, `AskUserQuestion`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New "Inspect last failures" section reads the pytest
`lastfailed` cache JSON directly — instant, no
collection overhead, and filters to `tests/`-prefixed
entries to avoid stale junk paths.
Also,
- add `jq` tool permission for `.pytest_cache/` files
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
All tests are registrar-actor integration scenarios
sharing intertwined helpers + `enable_modules=[__name__]`
task fns, so keep as one mod but rename to reflect
content. Now lives alongside `test_multiaddr.py` in
the new `tests/discovery/` subpkg.
Also,
- update 5 refs in `/run-tests` SKILL.md to match
the new path
- add `discovery/` subdir to the test directory
layout tree
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Expand `run-tests` skill `allowed-tools` to cover
the documented pre-flight workflow: `git rev-parse`
for worktree detection, `python --version`, and
`UV_PROJECT_ENVIRONMENT=py* uv sync` for venv
setup. Also dedup `gh api`/`gh pr` entries in
`settings.local.json` and widen `py313` → `py*`
so non-3.13 setups aren't blocked.
Review: PR #440 (copilot-pull-request-reviewer)
https://github.com/goodboy/tractor/pull/440
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code