From 489dc6d0ccb9c820803fb0078a04cf54189647ae Mon Sep 17 00:00:00 2001 From: goodboy Date: Mon, 20 Apr 2026 16:04:19 -0400 Subject: [PATCH] Add prompt-io log for `subint` hang-class docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint` backend hang classes + arm `dump_on_hang`"). Substantive bc the two new `ai/conc-anal/` docs were jointly authored — user framed the two-class split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang; claude drafted the prose and the test-side cross-linking comments. `.raw.md` is in diff-ref mode — per-file pointers via `git diff e92e3cd2~1..e92e3cd2 -- ` rather than re-embedding content that already lives in `git log -p`. Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- .../20260420T192739Z_5e8cd8b2_prompt_io.md | 111 ++++++++++ ...20260420T192739Z_5e8cd8b2_prompt_io.raw.md | 198 ++++++++++++++++++ 2 files changed, 309 insertions(+) create mode 100644 ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md create mode 100644 ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.raw.md diff --git a/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md b/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md new file mode 100644 index 00000000..32478dd7 --- /dev/null +++ b/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md @@ -0,0 +1,111 @@ +--- +model: claude-opus-4-7[1m] +service: claude +session: subint-phase-b-hang-classification +timestamp: 2026-04-20T19:27:39Z +git_ref: HEAD (pre-commit; on branch `subint_spawner_backend`) +scope: docs +substantive: true +raw_file: 20260420T192739Z_5e8cd8b2_prompt_io.raw.md +--- + +## Prompt + +Collab between user (`goodboy`) and `claude` to +disambiguate two distinct hang modes hit during +Phase B subint-spawn-backend bringup (issue #379). +The user ran the failing suites, captured `strace` +evidence on hung pytest pids, and set the framing: + +> "These cannot be the same bug. Different fix +> paths. Write them up separately or we'll keep +> conflating them." + +Follow-on asks: +- Cross-link each doc from its reproducer test so + a future triager lands on the analysis without + needing to grep `ai/conc-anal/`. +- On `test_stale_entry_is_deleted`: arm + `dump_on_hang(seconds=20, path=...)` so a + regression captures a stack dump (pytest stderr + capture otherwise eats `faulthandler` output). +- Keep `test_stale_entry_is_deleted` un-`skip`ped + so the dump file remains inspectable. + +Defer the actual fix for the class-2 hang +(Ctrl-C-able, tractor-side bug) to a later commit +per the incremental-commits preference. + +## Response summary + +Wrote two new `ai/conc-anal/` classification docs +and cross-linked them from the reproducer tests: + +1. **`subint_sigint_starvation_issue.md`** — + class 1: abandoned-legacy-subint thread + + shared GIL starves main trio loop → + signal-wakeup-fd pipe fills → SIGINT silently + dropped (`write() = EAGAIN`). Pytest process + un-Ctrl-C-able. Structurally a CPython limit; + blocked on `msgspec` PEP 684 support + (jcrist/msgspec#563). Reproducer: + `test_stale_entry_is_deleted[subint]`. + +2. **`subint_cancel_delivery_hang_issue.md`** — + class 2: parent-side trio task parks on an + orphaned IPC channel after subint teardown; + no clean EOF delivered to waiting receiver. + Ctrl-C-able (main trio loop iterating fine). + OUR bug to fix. Candidate fix: explicit + parent-side channel abort in `subint_proc`'s + hard-kill teardown. Reproducer: + `test_subint_non_checkpointing_child`. + +Test-side cross-links: +- `tests/discovery/test_registrar.py`: + `test_stale_entry_is_deleted` → `trio.run(main)` + wrapped in `dump_on_hang(seconds=20, + path=)`; long inline comment + summarizes `strace` evidence + root-cause chain + and points at both docs. +- `tests/test_subint_cancellation.py`: + `test_subint_non_checkpointing_child` docstring + extended with "KNOWN ISSUE (Ctrl-C-able hang)" + section pointing at the class-2 doc + noting + the class-1 doc is NOT what this test hits. + +## Files changed + +- `ai/conc-anal/subint_sigint_starvation_issue.md` + — new, 205 LOC +- `ai/conc-anal/subint_cancel_delivery_hang_issue.md` + — new, 161 LOC +- `tests/discovery/test_registrar.py` — +52/-1 + (arm `dump_on_hang`, inline-comment cross-link) +- `tests/test_subint_cancellation.py` — +26 + (docstring "KNOWN ISSUE" block) + +## Human edits + +Substantive collab — prose was jointly iterated: + +- User framed the two-doc split, set the + classification criteria (Ctrl-C-able vs not), + and provided the `strace` evidence. +- User decided to keep `test_stale_entry_is_deleted` + un-`skip`ped (my initial suggestion was + `pytestmark.skipif(spawn_backend=='subint')`). +- User chose the candidate fix ordering for + class 2 and marked "explicit parent-side channel + abort" as the surgical preferred fix. +- User picked the file naming convention + (`subint__issue.md`) over my initial + `hang_class_{1,2}.md`. +- Assistant drafted the prose, aggregated prior- + session root-cause findings from Phase B.2/B.3 + bringup, and wrote the test-side cross-linking + comments. + +No further mechanical edits expected before +commit; user may still rewrap via +`scripts/rewrap.py` if preferred. diff --git a/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.raw.md b/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.raw.md new file mode 100644 index 00000000..9c70fa2c --- /dev/null +++ b/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.raw.md @@ -0,0 +1,198 @@ +--- +model: claude-opus-4-7[1m] +service: claude +timestamp: 2026-04-20T19:27:39Z +git_ref: HEAD (pre-commit; will land on branch `subint_spawner_backend`) +diff_cmd: git diff HEAD~1..HEAD +--- + +Collab between `goodboy` (user) and `claude` (this +assistant) spanning multiple test-run iterations on +branch `subint_spawner_backend`. The user ran the +failing suites, captured `strace` evidence on the +hung pytest pids, and set the direction ("these are +two different hangs — write them up separately so +we don't re-confuse ourselves later"). The assistant +aggregated prior-session findings (Phase B.2/B.3 +bringup) into two classification docs + test-side +cross-links. All prose was jointly iterated; the +user had final say on framing and decided which +candidate fix directions to list. + +## Per-file generated content + +### `ai/conc-anal/subint_sigint_starvation_issue.md` (new, 205 LOC) + +> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md` + +Writes up the "abandoned-legacy-subint thread wedges +the parent trio loop" class. Key sections: + +- **Symptom** — `test_stale_entry_is_deleted[subint]` + hangs indefinitely AND is un-Ctrl-C-able. +- **Evidence** — annotated `strace` excerpt showing + SIGINT delivered to pytest, C-level signal handler + tries to write to the signal-wakeup-fd pipe, gets + `write() = -1 EAGAIN (Resource temporarily + unavailable)`. Pipe is full because main trio loop + isn't iterating often enough to drain it. +- **Root-cause chain** — our hard-kill abandons the + `daemon=True` driver OS thread after + `_HARD_KILL_TIMEOUT`; the subint *inside* that + thread is still running `trio.run()`; + `_interpreters.destroy()` cannot force-stop a + running subint (raises `InterpreterError`); legacy + subints share the main GIL → abandoned subint + starves main trio loop → wakeup-fd fills → SIGINT + silently dropped. +- **Why it's structurally a CPython limit** — no + public force-destroy primitive for a running + subint; the only escape is per-interpreter GIL + isolation, gated on msgspec PEP 684 adoption + (jcrist/msgspec#563). +- **Current escape hatch** — harness-side SIGINT + loop in the `daemon` fixture teardown that kills + the bg registrar subproc, eventually unblocking + a parent-side recv enough for the main loop to + drain the wakeup pipe. + +### `ai/conc-anal/subint_cancel_delivery_hang_issue.md` (new, 161 LOC) + +> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md` + +Writes up the *sibling* hang class — same subint +backend, distinct root cause: + +- **TL;DR** — Ctrl-C-able, so NOT the SIGINT- + starvation class; main trio loop iterates fine; + ours to fix. +- **Symptom** — `test_subint_non_checkpointing_child` + hangs past the expected `_HARD_KILL_TIMEOUT` + budget even after the subint is torn down. +- **Diagnosis** — a parent-side trio task (likely + a `chan.recv()` in `process_messages`) parks on + an orphaned IPC channel; channel was torn down + without emitting a clean EOF / + `BrokenResourceError` to the waiting receiver. +- **Candidate fix directions** — listed in rough + order of preference: + 1. Explicit parent-side channel abort in + `subint_proc`'s hard-kill teardown (surgical; + most likely). + 2. Audit `process_messages` to add a timeout or + cancel-scope protection that catches the + orphaned-recv state. + 3. Wrap subint IPC channel construction in a + sentinel that can force-close from the parent + side regardless of subint liveness. + +### `tests/discovery/test_registrar.py` (modified, +52/-1 LOC) + +> `git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py` + +Wraps the `trio.run(main)` call at the bottom of +`test_stale_entry_is_deleted` in +`dump_on_hang(seconds=20, path=)`. +Adds a long inline comment that: +- Enumerates variant-by-variant status + (`[trio]`/`[mp_*]` = clean; `[subint]` = hangs + + un-Ctrl-C-able) +- Summarizes the `strace` evidence and root-cause + chain inline (so a future reader hitting this + test doesn't need to cross-ref the doc to + understand the hang shape) +- Points at + `ai/conc-anal/subint_sigint_starvation_issue.md` + for full analysis +- Cross-links to the *sibling* + `subint_cancel_delivery_hang_issue.md` so + readers can tell the two classes apart +- Explains why it's kept un-`skip`ped: the dump + file is useful if the hang ever returns after + a refactor. pytest stderr capture would + otherwise eat `faulthandler` output, hence the + file path. + +### `tests/test_subint_cancellation.py` (modified, +26 LOC) + +> `git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py` + +Extends the docstring of +`test_subint_non_checkpointing_child` with a +"KNOWN ISSUE (Ctrl-C-able hang)" block: +- Describes the current hang: parent-side orphaned + IPC recv after hard-kill; distinct from the + SIGINT-starvation sibling class. +- Cites `strace` distinguishing signal: wakeup-fd + `write() = 1` (not `EAGAIN`) — i.e. main loop + iterating. +- Points at + `ai/conc-anal/subint_cancel_delivery_hang_issue.md` + for full analysis + candidate fix directions. +- Clarifies that the *other* sibling doc + (SIGINT-starvation) is NOT what this test hits. + +## Non-code output + +### Classification reasoning (why two docs, not one) + +The user and I converged on the two-doc split after +running the suites and noticing two *qualitatively +different* hang symptoms: + +1. `test_stale_entry_is_deleted[subint]` — pytest + process un-Ctrl-C-able. Ctrl-C at the terminal + does nothing. Must kill-9 from another shell. +2. `test_subint_non_checkpointing_child` — pytest + process Ctrl-C-able. One Ctrl-C at the prompt + unblocks cleanly and the test reports a hang + via pytest-timeout. + +From the user: "These cannot be the same bug. +Different fix paths. Write them up separately or +we'll keep conflating them." + +`strace` on the `[subint]` hang gave the decisive +signal for the first class: + +``` +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable) +``` + +fd 5 is Python's signal-wakeup-fd pipe. `EAGAIN` +on a `write()` of 1 byte to a pipe means the pipe +buffer is full → reader side (main Python thread +inside `trio.run()`) isn't consuming. That's the +GIL-hostage signature. + +The second class's `strace` showed `write(5, "\2", +1) = 1` — clean drain — so the main trio loop was +iterating and the hang had to be on the application +side of things, not the kernel-↔-Python signal +boundary. + +### Why the candidate fix for class 2 is "explicit parent-side channel abort" + +The second hang class has the trio loop alive. A +parked `chan.recv()` that will never get bytes is +fundamentally a tractor-side resource-lifetime bug +— the IPC channel was torn down (subint destroyed) +but no one explicitly raised +`BrokenResourceError` at the parent-side receiver. +The `subint_proc` hard-kill path is the natural +place to add that notification, because it already +knows the subint is unreachable at that point. + +Alternative fix paths (blanket timeouts on +`process_messages`, sentinel-wrapped channels) are +less surgical and risk masking unrelated bugs — +hence the preference ordering in the doc. + +### Why we're not just patching the code now + +The user explicitly deferred the fix to a later +commit: "Document both classes now, land the fix +for class 2 separately so the diff reviews clean." +This matches the incremental-commits preference +from memory.