tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	4f12d69b41	Add `--shm` orphan sweep to `tractor-reap` Since `tractor.ipc._mp_bs.disable_mantracker()` turns off `mp.resource_tracker` entirely (see the conc-anal doc `subint_forkserver_mp_shared_memory_issue.md`), a hard-crashing actor can leave `/dev/shm/<key>` segments that nothing else GCs. New `tractor-reap` phase 2 sweeps them. Deats, - `tractor/_testing/_reap.py`: add `find_orphaned_shm()` + `reap_shm()` helpers. Match criteria: regular file under `/dev/shm`, owned by current uid, AND no live proc has it open (mmap'd or fd-held). In-use enumeration via `psutil.Process.memory_maps()` + `.open_files()` — xplatform, kernel-canonical (same answer `lsof` would give), no reliance on tractor-specific shm-key naming. - `_ensure_shm_supported()` guard: helpers raise `NotImplementedError` outside Linux/FreeBSD bc macOS POSIX shm has no fs-visible path (`shm_open` only) and Windows is a different story. - `scripts/tractor-reap`: new `--shm` (run after process reap) and `--shm-only` (skip process phase) flags. `-n` dry-runs both phases. Exit code is `1` if either phase had survivors/errors. - `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to the `testing` dep group; lazy-imported in `_reap.py` so the process-reap path stays import-clean without it. Also, - doc `--shm` in `.claude/skills/run-tests/SKILL.md` (new section 10c) — covers match criteria + the preservation guarantee for unrelated apps. - flip mitigation status in `subint_forkserver_mp_shared_memory_issue.md` from "could extend `tractor-reap`" to "implemented", with a note that callers should still UUID-pin shm keys to avoid cross-session collisions. Verified locally vs 81 in-use segments held by `piker`, `lttng-ust-`, `aja-shm-` — all preserved; only the genuinely-orphaned tractor segments got unlinked. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-27 11:35:33 -04:00
Gud Boi	6d76b60404	Add `tractor-reap` CLI + document auto-reap New `scripts/tractor-reap` CLI wraps the `_testing._reap` mod for manual zombie-subactor cleanup after crashed pytest sessions. Two modes: - orphan-mode (default): finds PPid==1 procs with cwd matching repo root + `python` in cmdline. - descendant-mode (`--parent <pid>`): scoped sweep under a still-live supervisor. SC-polite: SIGINT with bounded grace window (default 3s) before escalating to SIGKILL. Exit code signals whether escalation was needed (useful for CI health-checks). Also, document both the auto-reap fixture and the CLI in `/run-tests` SKILL.md (section 10). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-26 18:04:40 -04:00
Gud Boi	4106ba73ea	Codify capture-pipe hang lesson in skills Encode the hard-won lesson from the forkserver cancel-cascade investigation into two skill docs so future sessions grep-find it before spelunking into trio internals. Deats, - `.claude/skills/conc-anal/SKILL.md`: - new "Unbounded waits in cleanup paths" section — rule: bound every `await X.wait()` in cleanup paths with `trio.move_on_after()` unless the setter is unconditionally reachable. Recent example: `ipc_server.wait_for_no_more_peers()` in `async_main`'s finally (was unbounded, deadlocked when any peer handler stuck) - new "The capture-pipe-fill hang pattern" section — mechanism, grep-pointers to the existing `conftest.py` guards (`tests/conftest .py:258`, `:316`), cross-ref to the full post-mortem doc, and the grep-note: "if a multi-subproc tractor test hangs, `pytest -s` first, conc-anal second" - `.claude/skills/run-tests/SKILL.md`: new "Section 9: The pytest-capture hang pattern (CHECK THIS FIRST)" with symptom / cause / pre-existing guards to grep / three-step debug recipe (try `-s`, lower loglevel, redirect stdout/stderr) / signature of this bug vs. a real code hang / historical reference Cost several investigation sessions before the capture-pipe issue surfaced — it was masked by deeper cascade deadlocks. Once the cascades were fixed, the tree tore down enough to generate pipe-filling log volume. Lesson: grep this pattern first when any multi-subproc tractor test hangs under default pytest but passes with `-s`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 23:22:40 -04:00
Gud Boi	70d58c4bd2	Use SIGINT-first ladder in `run-tests` cleanup The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	d093c31979	Add zombie-actor check to `run-tests` skill Fork-based backends (esp. `subint_forkserver`) can leak child actor processes on cancelled / SIGINT'd test runs; the zombies keep the tractor default registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`) bound, so every subsequent session can't bind and 50+ unrelated tests fail with the same `TooSlowError` / "address in use" signature. Document the pre-flight + post-cancel check as a mandatory step 4. Deats, - primary signal: `ss -tlnp \| grep ':1616'` for a bound TCP registry listener — the authoritative check since :1616 is unique to our runtime - `pgrep -af` scoped to `$(pwd)/py[0-9]/bin/python. _actor_child_main\|subint-forkserv` for leftover actor/forkserver procs — scoped deliberately so we don't false-flag legit long-running tractor- embedding apps like `piker` - `ls /tmp/registry@.sock` for stale UDS sockets - scoped cleanup recipe (SIGTERM + SIGKILL sweep using the same `$(pwd)/py` pattern, UDS `rm -f`, re-verify) plus a fallback for when a zombie holds :1616 but doesn't match the pattern: `ss -tlnp` → kill by PID - explicit false-positive warning calling out the `piker` case (`~/repos/piker/py*/bin/python3 -m tractor._child ...`) so a bare `pgrep` doesn't lead to nuking unrelated apps Goal: short-circuit the "spelunking into test code" rabbit-hole when the real cause is just a leaked PID from a prior session, without collateral damage to other tractor-embedding projects on the same box. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	b1a0753a3f	Expand `/run-tests` venv pre-flight to cover all cases Rework section 3 from a worktree-only check into a structured 3-step flow: detect active venv, interpret results (Case A: active, B: none, C: worktree), then run import + collection checks. Deats, - Case B prompts via `AskUserQuestion` when no venv is detected, offering `uv sync` or manual activate - add `uv run` fallback section for envs where venv activation isn't practical - new allowed-tools: `uv run python`, `uv run pytest`, `uv pip show`, `AskUserQuestion` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:36 -04:00
Gud Boi	ba86d482e3	Add `lastfailed` cache inspection to `/run-tests` skill New "Inspect last failures" section reads the pytest `lastfailed` cache JSON directly — instant, no collection overhead, and filters to `tests/`-prefixed entries to avoid stale junk paths. Also, - add `jq` tool permission for `.pytest_cache/` files (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:36 -04:00
Gud Boi	1f1e09a786	Move `test_discovery` to `tests/discovery/test_registrar` All tests are registrar-actor integration scenarios sharing intertwined helpers + `enable_modules=[__name__]` task fns, so keep as one mod but rename to reflect content. Now lives alongside `test_multiaddr.py` in the new `tests/discovery/` subpkg. Also, - update 5 refs in `/run-tests` SKILL.md to match the new path - add `discovery/` subdir to the test directory layout tree (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-14 19:54:14 -04:00
Gud Boi	6b04650187	Widen `allowed-tools` and dedup `settings.local` Expand `run-tests` skill `allowed-tools` to cover the documented pre-flight workflow: `git rev-parse` for worktree detection, `python --version`, and `UV_PROJECT_ENVIRONMENT=py* uv sync` for venv setup. Also dedup `gh api`/`gh pr` entries in `settings.local.json` and widen `py313` → `py*` so non-3.13 setups aren't blocked. Review: PR #440 (copilot-pull-request-reviewer) https://github.com/goodboy/tractor/pull/440 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-10 18:21:45 -04:00
Gud Boi	0286d36ed7	Add repo-local `claude` skills + settings + gitignore Add `/run-tests`, `/conc-anal` skill definitions and `/pr-msg` `format-reference.md` that live in-repo (not symlinked from `ai.skillz`). - `/run-tests`: `pytest` suite runner with dev-workflow helpers, never-auto-commit rule. - `/conc-anal`: concurrency analysis skill. - `/pr-msg` `format-reference.md`: canonical PR description structure + cross-service ref-links. - `ai_notes/docs_todos.md`: `literalinclude` idea. - `settings.local.json`: permission rules for `gh`, `git`, `python3`, `cat`, skill invocations. - `.gitignore`: ignore commit-msg/pr-msg `msgs/`, `LATEST` files, review ctx, session conf, claude worktrees. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-10 16:37:34 -04:00

10 Commits (38ffb875bdcb8ecc10149143349a3fdf5b9a7f9e)