tractor/ai/conc-anal/subint_sigint_starvation_is...

# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)

Follow-up to the Phase B subint spawn-backend PR (see
`tractor.spawn._subint`, issue #379). The hard-kill escape
hatch we landed (`_HARD_KILL_TIMEOUT`, bounded shields,
`daemon=True` driver-thread abandonment) handles *most*
stuck-subint scenarios cleanly, but there's one class of
hang that can't be fully escaped from within tractor: a
still-running abandoned sub-interpreter can starve the
**parent's** trio event loop to the point where **SIGINT is
effectively dropped by the kernel ↔ Python boundary** —
making the pytest process un-Ctrl-C-able.

## Symptom

Running `test_stale_entry_is_deleted[subint]` under
`--spawn-backend=subint`:

1. Test spawns a subactor (`transport_fails_actor`) which
   kills its own IPC server and then
   `trio.sleep_forever()`.
2. Parent tries `Portal.cancel_actor()` → channel
   disconnected → fast return.
3. Nursery teardown triggers our `subint_proc` cancel path.
   Portal-cancel fails (dead channel),
   `_HARD_KILL_TIMEOUT` fires, driver thread is abandoned
   (`daemon=True`), `_interpreters.destroy(interp_id)`
   raises `InterpreterError` (because the subint is still
   running).
4. Test appears to hang indefinitely at the *outer*
   `async with tractor.open_nursery() as an:` exit.
5. `Ctrl-C` at the terminal does nothing. The pytest
   process is un-interruptable.

## Evidence

### `strace` on the hung pytest process

```
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]}) = 140585542325792
```

Translated:

- Kernel delivers `SIGINT` to pytest.
- CPython's C-level signal handler fires and tries to
  write the signal number byte (`0x02` = SIGINT) to fd 37
  — the **Python signal-wakeup fd** (set via
  `signal.set_wakeup_fd()`, which trio uses to wake its
  event loop on signals).
- Write returns `EAGAIN` — **the pipe is full**. Nothing
  is draining it.
- `rt_sigreturn` with the signal masked off — signal is
  "handled" from the kernel's perspective but the actual
  Python-level handler (and therefore trio's
  `KeyboardInterrupt` delivery) never runs.

### Stack dump (via `tractor.devx.dump_on_hang`)

At 20s into the hang, only the **main thread** is visible:

```
Thread 0x...7fdca0191780 [python] (most recent call first):
  File ".../trio/_core/_io_epoll.py", line 245 in get_events
  File ".../trio/_core/_run.py", line 2415 in run
  File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted
  ...
```

No driver thread shows up. The abandoned-legacy-subint
thread still exists from the OS's POV (it's still running
inside `_interpreters.exec()` driving the subint's
`trio.run()` on `trio.sleep_forever()`) but the **main
interp's faulthandler can't see threads currently executing
inside a sub-interpreter's tstate**. Concretely: the thread
is alive, holding state we can't introspect from here.

## Root cause analysis

The most consistent explanation for both observations:

1. **Legacy-config subinterpreters share the main GIL.**
   PEP 734's public `concurrent.interpreters.create()`
   defaults to `'isolated'` (per-interp GIL), but tractor
   uses `_interpreters.create('legacy')` as a workaround
   for C extensions that don't yet support PEP 684
   (notably `msgspec`, see
   [jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
   Legacy-mode subints share process-global state
   including the GIL.

2. **Our abandoned subint thread never exits.** After our
   hard-kill timeout, `driver_thread.join()` is abandoned
   via `abandon_on_cancel=True` and the thread is
   `daemon=True` so proc-exit won't block on it — but the
   thread *itself* is still alive inside
   `_interpreters.exec()`, driving a `trio.run()` that
   will never return (the subint actor is in
   `trio.sleep_forever()`).

3. **`_interpreters.destroy()` cannot force-stop a running
   subint.** It raises `InterpreterError` on any
   still-running subinterpreter; there is no public
   CPython API to force-destroy one.

4. **Shared-GIL + non-terminating subint thread → main
   trio loop starvation.** Under enough load (the subint's
   trio event loop iterating in the background, IPC-layer
   tasks still in the subint, etc.) the main trio event
   loop can fail to iterate frequently enough to drain its
   wakeup pipe. Once that pipe fills, `SIGINT` writes from
   the C signal handler return `EAGAIN` and signals are
   silently dropped — exactly what `strace` shows.

The shielded
`await actor_nursery._join_procs.wait()` at the top of
`subint_proc` (inherited unchanged from the `trio_proc`
pattern) is structurally involved too: if main trio *does*
get a schedule slice, it'd find the `subint_proc` task
parked on `_join_procs` under shield — which traps whatever
`Cancelled` arrives. But that's a second-order effect; the
signal-pipe-full condition is the primary "Ctrl-C doesn't
work" cause.

## Why we can't fix this from inside tractor

- **No force-destroy API.** CPython provides neither a
  `_interpreters.force_destroy()` nor a thread-
  cancellation primitive (`pthread_cancel` is actively
  discouraged and unavailable on Windows). A subint stuck
  in pure-Python loops (or worse, C code that doesn't poll
  for signals) is structurally unreachable from outside.
- **Shared GIL is the root scheduling issue.** As long as
  we're forced into legacy-mode subints for `msgspec`
  compatibility, the abandoned-thread scenario is
  fundamentally a process-global GIL-starvation window.
- **`signal.set_wakeup_fd()` is process-global.** Even if
  we wanted to put our own drainer on the wakeup pipe,
  only one party owns it at a time.

## Current workaround

- **Fixture-side SIGINT loop on the `daemon` subproc** (in
  this test's `daemon: subprocess.Popen` fixture in
  `tests/conftest.py`). The daemon dying closes its end of
  the registry IPC, which unblocks a pending recv in main
  trio's IPC-server task, which lets the event loop
  iterate, which drains the wakeup pipe, which finally
  delivers the test-harness SIGINT.
- **Module-level skip on py3.13**
  (`pytest.importorskip('concurrent.interpreters')`) — the
  private `_interpreters` C module exists on 3.13 but the
  multi-trio-task interaction hangs silently there
  independently of this issue.

## Path forward

1. **Primary**: upstream `msgspec` PEP 684 adoption
   ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
   Unlocks `concurrent.interpreters.create()` isolated
   mode → per-interp GIL → abandoned subint threads no
   longer starve the parent's main trio loop. At that
   point we can flip `_subint.py` back to the public API
   (`create()` / `Interpreter.exec()` / `Interpreter.close()`)
   and drop the private `_interpreters` path.

2. **Secondary**: watch CPython for a public
   force-destroy primitive. If something like
   `Interpreter.close(force=True)` lands, we can use it as
   a hard-kill final stage and actually tear down
   abandoned subints.

3. **Harness-level**: document the fixture-side SIGINT
   loop pattern as the "known workaround" for subint-
   backend tests that can leave background state holding
   the main event loop hostage.

## References

- PEP 734 (`concurrent.interpreters`):
  <https://peps.python.org/pep-0734/>
- PEP 684 (per-interpreter GIL):
  <https://peps.python.org/pep-0684/>
- `msgspec` PEP 684 tracker:
  <https://github.com/jcrist/msgspec/issues/563>
- CPython `_interpretersmodule.c` source:
  <https://github.com/python/cpython/blob/main/Modules/_interpretersmodule.c>
- `tractor.spawn._subint` module docstring (in-tree
  explanation of the legacy-mode choice and its
  tradeoffs).

## Reproducer

```
./py314/bin/python -m pytest \
  tests/discovery/test_registrar.py::test_stale_entry_is_deleted \
  --spawn-backend=subint \
  --tb=short --no-header -v
```

Hangs indefinitely without the fixture-side SIGINT loop;
with the loop, the test completes (albeit with the
abandoned-thread warning in logs).

## Additional known-hanging tests (same class)

All three tests below exhibit the same
signal-wakeup-fd-starvation fingerprint (`write() → EAGAIN`
on the wakeup pipe after enough SIGINT attempts) and
share the same structural cause — abandoned legacy-subint
driver threads contending with the main interpreter for
the shared GIL until the main trio loop can no longer
drain its wakeup pipe fast enough to deliver signals.

They're listed separately because each exposes the class
under a different load pattern worth documenting.

### `tests/discovery/test_registrar.py::test_stale_entry_is_deleted[subint]`

Original exemplar — see the **Symptom** and **Evidence**
sections above. One abandoned subint
(`transport_fails_actor`, stuck in `trio.sleep_forever()`
after self-cancelling its IPC server) is sufficient to
tip main into starvation once the harness's `daemon`
fixture subproc keeps its half of the registry IPC alive.

### `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep[subint-False]`

Cancel a grandchild that's in sync Python sleep from 2
nurseries up. The test's own docstring declares the
dependency: "its parent should issue a 'zombie reaper' to
hard kill it after sufficient timeout" — which for
`trio`/`mp_*` is an OS-level `SIGKILL` of the grandchild
subproc. **Under `subint` there's no equivalent** (no
public CPython API to force-destroy a running
sub-interpreter), so the grandchild's sync-sleeping
`trio.run()` persists inside its abandoned driver thread
indefinitely. The nested actor-tree (parent → child →
grandchild, all subints) means a single cancel triggers
multiple concurrent hard-kill abandonments, each leaving
a live driver thread.

This test often only manifests the starvation under
**full-suite runs** rather than solo execution —
earlier-in-session subint tests also leave abandoned
driver threads behind, and the combined population is
what actually tips main trio into starvation. Solo runs
may stay Ctrl-C-able with fewer abandoned threads in the
mix.

### `tests/test_cancellation.py::test_multierror_fast_nursery[subint-25-0.5]`

Nursery-error-path throughput stress-test parametrized
for **25 concurrent subactors**. When the multierror
fires and the nursery cancels, every subactor goes
through our `subint_proc` teardown. The bounded
hard-kills run in parallel (all `subint_proc` tasks are
sibling trio tasks), so the timeout budget is ~3s total
rather than 3s × 25. After that, **25 abandoned
`daemon=True` driver threads are simultaneously alive** —
an extreme pressure multiplier on the same mechanism.

The `strace` fingerprint is striking under this load: six
or more **successful** `write(16, "\2", 1) = 1` calls
(main trio getting brief GIL slices, each long enough to
drain exactly one wakeup-pipe byte) before finally
saturating with `EAGAIN`:

```
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = 1
rt_sigreturn({mask=[WINCH]})            = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = 1
rt_sigreturn({mask=[WINCH]})            = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = 1
rt_sigreturn({mask=[WINCH]})            = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = 1
rt_sigreturn({mask=[WINCH]})            = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = 1
rt_sigreturn({mask=[WINCH]})            = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = 1
rt_sigreturn({mask=[WINCH]})            = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]})            = 140141623162400
```

Those successful writes indicate CPython's
`sys.getswitchinterval()`-based GIL round-robin *is*
giving main brief slices — just never long enough to run
the Python-level signal handler through to the point
where trio converts the delivered SIGINT into a
`Cancelled` on the appropriate scope. Once the
accumulated write rate outpaces main's drain rate, the
pipe saturates and subsequent signals are silently
dropped.

The `pstree` below (pid `530060` = hung `pytest`) shows
the subint-driver thread population at the moment of
capture. Even with fewer than the full 25 shown (pstree
truncates thread names to `subint-driver[<interp_id>` —
interpreters `3` and `4` visible across 16 thread
entries), the GIL-contender count is more than enough to
explain the starvation:

```
>>> pstree -snapt 530060
systemd,1 --switched-root --system --deserialize=40
  └─login,1545 --
      └─bash,1872
          └─sway,2012
              └─alacritty,70471 -e xonsh
                  └─xonsh,70487 .../bin/xonsh
                      └─uv,70955 run xonsh
                          └─xonsh,70959 .../py314/bin/xonsh
                              └─python,530060 .../py314/bin/pytest -v tests/test_cancellation.py --spawn-backend=subint
                                  ├─{subint-driver[3},531857
                                  ├─{subint-driver[3},531860
                                  ├─{subint-driver[3},531862
                                  ├─{subint-driver[3},531866
                                  ├─{subint-driver[3},531877
                                  ├─{subint-driver[3},531882
                                  ├─{subint-driver[3},531884
                                  ├─{subint-driver[3},531945
                                  ├─{subint-driver[3},531950
                                  ├─{subint-driver[3},531952
                                  ├─{subint-driver[4},531956
                                  ├─{subint-driver[4},531959
                                  ├─{subint-driver[4},531961
                                  ├─{subint-driver[4},531965
                                  ├─{subint-driver[4},531968
                                  └─{subint-driver[4},531979
```

(`pstree` uses `{...}` to denote threads rather than
processes — these are all the **driver OS-threads** our
`subint_proc` creates with name
`f'subint-driver[{interp_id}]'`. Every one of them is
still alive, executing `_interpreters.exec()` inside a
sub-interpreter our hard-kill has abandoned. At 16+
abandoned driver threads competing for the main GIL, the
main-interpreter trio loop gets starved and signal
delivery stalls.)
-												Doc `subint` backend hang classes + arm `dump_on_hang`

Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.

Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
  + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
  → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
  wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
  `msgspec` PEP 684 (jcrist/msgspec#563)

- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
  an orphaned IPC channel after subint teardown — no clean EOF delivered
  to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
  to fix. Candidate fix: explicit parent-side channel abort in
  `subint_proc`'s hard-kill teardown

Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
  `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
  captures a stack dump. Kept un- skipped so the dump file is
  inspectable

- `test_subint_non_checkpointing_child` (→ delivery class): extend
  docstring with a "KNOWN ISSUE" block pointing at the analysis

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

											
										
										
											2026-04-20 19:28:00 +00:00
+								# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)
 								Follow-up to the Phase B subint spawn-backend PR (see
 								`tractor.spawn._subint`, issue #379). The hard-kill escape
 								hatch we landed (`_HARD_KILL_TIMEOUT`, bounded shields,
 								`daemon=True` driver-thread abandonment) handles *most*
 								stuck-subint scenarios cleanly, but there's one class of
 								hang that can't be fully escaped from within tractor: a
 								still-running abandoned sub-interpreter can starve the
 								**parent's** trio event loop to the point where **SIGINT is
 								effectively dropped by the kernel ↔ Python boundary** —
 								making the pytest process un-Ctrl-C-able.
 								## Symptom
 								Running `test_stale_entry_is_deleted[subint]` under
 								`--spawn-backend=subint`:
 . Test spawns a subactor (`transport_fails_actor`) which
 								   kills its own IPC server and then
 								   `trio.sleep_forever()`.
 . Parent tries `Portal.cancel_actor()` → channel
 								   disconnected → fast return.
 . Nursery teardown triggers our `subint_proc` cancel path.
 								   Portal-cancel fails (dead channel),
 								   `_HARD_KILL_TIMEOUT` fires, driver thread is abandoned
 								   (`daemon=True`), `_interpreters.destroy(interp_id)`
 								   raises `InterpreterError` (because the subint is still
 								   running).
 . Test appears to hang indefinitely at the *outer*
 								   `async with tractor.open_nursery() as an:` exit.
 . `Ctrl-C` at the terminal does nothing. The pytest
 								   process is un-interruptable.
 								## Evidence
 								### `strace` on the hung pytest process
 								```
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
 								rt_sigreturn({mask=[WINCH]}) = 140585542325792
 								```
 								Translated:
 								- Kernel delivers `SIGINT` to pytest.
 								- CPython's C-level signal handler fires and tries to
 								  write the signal number byte (`0x02` = SIGINT) to fd 37
 								  — the **Python signal-wakeup fd** (set via
 								  `signal.set_wakeup_fd()`, which trio uses to wake its
 								  event loop on signals).
 								- Write returns `EAGAIN` — **the pipe is full**. Nothing
 								  is draining it.
 								- `rt_sigreturn` with the signal masked off — signal is
 								  "handled" from the kernel's perspective but the actual
 								  Python-level handler (and therefore trio's
 								  `KeyboardInterrupt` delivery) never runs.
 								### Stack dump (via `tractor.devx.dump_on_hang`)
 								At 20s into the hang, only the **main thread** is visible:
 								```
 								Thread 0x...7fdca0191780 [python] (most recent call first):
 								  File ".../trio/_core/_io_epoll.py", line 245 in get_events
 								  File ".../trio/_core/_run.py", line 2415 in run
 								  File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted
 								  ...
 								```
 								No driver thread shows up. The abandoned-legacy-subint
 								thread still exists from the OS's POV (it's still running
 								inside `_interpreters.exec()` driving the subint's
 								`trio.run()` on `trio.sleep_forever()`) but the **main
 								interp's faulthandler can't see threads currently executing
 								inside a sub-interpreter's tstate**. Concretely: the thread
 								is alive, holding state we can't introspect from here.
 								## Root cause analysis
 								The most consistent explanation for both observations:
 . **Legacy-config subinterpreters share the main GIL.**
 								   PEP 734's public `concurrent.interpreters.create()`
 								   defaults to `'isolated'` (per-interp GIL), but tractor
 								   uses `_interpreters.create('legacy')` as a workaround
 								   for C extensions that don't yet support PEP 684
 								   (notably `msgspec`, see
 								   [jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
 								   Legacy-mode subints share process-global state
 								   including the GIL.
 . **Our abandoned subint thread never exits.** After our
 								   hard-kill timeout, `driver_thread.join()` is abandoned
 								   via `abandon_on_cancel=True` and the thread is
 								   `daemon=True` so proc-exit won't block on it — but the
 								   thread *itself* is still alive inside
 								   `_interpreters.exec()`, driving a `trio.run()` that
 								   will never return (the subint actor is in
 								   `trio.sleep_forever()`).
 . **`_interpreters.destroy()` cannot force-stop a running
 								   subint.** It raises `InterpreterError` on any
 								   still-running subinterpreter; there is no public
 								   CPython API to force-destroy one.
 . **Shared-GIL + non-terminating subint thread → main
 								   trio loop starvation.** Under enough load (the subint's
 								   trio event loop iterating in the background, IPC-layer
 								   tasks still in the subint, etc.) the main trio event
 								   loop can fail to iterate frequently enough to drain its
 								   wakeup pipe. Once that pipe fills, `SIGINT` writes from
 								   the C signal handler return `EAGAIN` and signals are
 								   silently dropped — exactly what `strace` shows.
 								The shielded
 								`await actor_nursery._join_procs.wait()` at the top of
 								`subint_proc` (inherited unchanged from the `trio_proc`
 								pattern) is structurally involved too: if main trio *does*
 								get a schedule slice, it'd find the `subint_proc` task
 								parked on `_join_procs` under shield — which traps whatever
 								`Cancelled` arrives. But that's a second-order effect; the
 								signal-pipe-full condition is the primary "Ctrl-C doesn't
 								work" cause.
 								## Why we can't fix this from inside tractor
 								- **No force-destroy API.** CPython provides neither a
 								  `_interpreters.force_destroy()` nor a thread-
 								  cancellation primitive (`pthread_cancel` is actively
 								  discouraged and unavailable on Windows). A subint stuck
 								  in pure-Python loops (or worse, C code that doesn't poll
 								  for signals) is structurally unreachable from outside.
 								- **Shared GIL is the root scheduling issue.** As long as
 								  we're forced into legacy-mode subints for `msgspec`
 								  compatibility, the abandoned-thread scenario is
 								  fundamentally a process-global GIL-starvation window.
 								- **`signal.set_wakeup_fd()` is process-global.** Even if
 								  we wanted to put our own drainer on the wakeup pipe,
 								  only one party owns it at a time.
 								## Current workaround
 								- **Fixture-side SIGINT loop on the `daemon` subproc** (in
 								  this test's `daemon: subprocess.Popen` fixture in
 								  `tests/conftest.py`). The daemon dying closes its end of
 								  the registry IPC, which unblocks a pending recv in main
 								  trio's IPC-server task, which lets the event loop
 								  iterate, which drains the wakeup pipe, which finally
 								  delivers the test-harness SIGINT.
 								- **Module-level skip on py3.13**
 								  (`pytest.importorskip('concurrent.interpreters')`) — the
 								  private `_interpreters` C module exists on 3.13 but the
 								  multi-trio-task interaction hangs silently there
 								  independently of this issue.
 								## Path forward
 . **Primary**: upstream `msgspec` PEP 684 adoption
 								   ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
 								   Unlocks `concurrent.interpreters.create()` isolated
 								   mode → per-interp GIL → abandoned subint threads no
 								   longer starve the parent's main trio loop. At that
 								   point we can flip `_subint.py` back to the public API
 								   (`create()` / `Interpreter.exec()` / `Interpreter.close()`)
 								   and drop the private `_interpreters` path.
 . **Secondary**: watch CPython for a public
 								   force-destroy primitive. If something like
 								   `Interpreter.close(force=True)` lands, we can use it as
 								   a hard-kill final stage and actually tear down
 								   abandoned subints.
 . **Harness-level**: document the fixture-side SIGINT
 								   loop pattern as the "known workaround" for subint-
 								   backend tests that can leave background state holding
 								   the main event loop hostage.
 								## References
 								- PEP 734 (`concurrent.interpreters`):
 								  <https://peps.python.org/pep-0734/>
 								- PEP 684 (per-interpreter GIL):
 								  <https://peps.python.org/pep-0684/>
 								- `msgspec` PEP 684 tracker:
 								  <https://github.com/jcrist/msgspec/issues/563>
 								- CPython `_interpretersmodule.c` source:
 								  <https://github.com/python/cpython/blob/main/Modules/_interpretersmodule.c>
 								- `tractor.spawn._subint` module docstring (in-tree
 								  explanation of the legacy-mode choice and its
 								  tradeoffs).
 								## Reproducer
 								```
 								./py314/bin/python -m pytest \
 								  tests/discovery/test_registrar.py::test_stale_entry_is_deleted \
 								  --spawn-backend=subint \
 								  --tb=short --no-header -v
 								```
 								Hangs indefinitely without the fixture-side SIGINT loop;
 								with the loop, the test completes (albeit with the
 								abandoned-thread warning in logs).
-												Expand `subint` sigint-starvation hang catalog

Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.

Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
  actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
  reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
  the grandchild persists in its abandoned driver thread. Often only
  manifests under full-suite runs (earlier tests seed the
  abandoned-thread pool).

- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
  all go through teardown on the multierror. Bounded hard-kills run in
  parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
  abandoned driver threads simultaneously alive, an extreme pressure
  multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
  (GIL round-robin IS giving main brief slices) before finally
  saturating with `EAGAIN`.

Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

											
										
										
											2026-04-21 21:42:37 +00:00
 								## Additional known-hanging tests (same class)
 								All three tests below exhibit the same
 								signal-wakeup-fd-starvation fingerprint (`write() → EAGAIN`
 								on the wakeup pipe after enough SIGINT attempts) and
 								share the same structural cause — abandoned legacy-subint
 								driver threads contending with the main interpreter for
 								the shared GIL until the main trio loop can no longer
 								drain its wakeup pipe fast enough to deliver signals.
 								They're listed separately because each exposes the class
 								under a different load pattern worth documenting.
 								### `tests/discovery/test_registrar.py::test_stale_entry_is_deleted[subint]`
 								Original exemplar — see the **Symptom** and **Evidence**
 								sections above. One abandoned subint
 								(`transport_fails_actor`, stuck in `trio.sleep_forever()`
 								after self-cancelling its IPC server) is sufficient to
 								tip main into starvation once the harness's `daemon`
 								fixture subproc keeps its half of the registry IPC alive.
 								### `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep[subint-False]`
 								Cancel a grandchild that's in sync Python sleep from 2
 								nurseries up. The test's own docstring declares the
 								dependency: "its parent should issue a 'zombie reaper' to
 								hard kill it after sufficient timeout" — which for
 								`trio`/`mp_*` is an OS-level `SIGKILL` of the grandchild
 								subproc. **Under `subint` there's no equivalent** (no
 								public CPython API to force-destroy a running
 								sub-interpreter), so the grandchild's sync-sleeping
 								`trio.run()` persists inside its abandoned driver thread
 								indefinitely. The nested actor-tree (parent → child →
 								grandchild, all subints) means a single cancel triggers
 								multiple concurrent hard-kill abandonments, each leaving
 								a live driver thread.
 								This test often only manifests the starvation under
 								**full-suite runs** rather than solo execution —
 								earlier-in-session subint tests also leave abandoned
 								driver threads behind, and the combined population is
 								what actually tips main trio into starvation. Solo runs
 								may stay Ctrl-C-able with fewer abandoned threads in the
 								mix.
 								### `tests/test_cancellation.py::test_multierror_fast_nursery[subint-25-0.5]`
 								Nursery-error-path throughput stress-test parametrized
 								for **25 concurrent subactors**. When the multierror
 								fires and the nursery cancels, every subactor goes
 								through our `subint_proc` teardown. The bounded
 								hard-kills run in parallel (all `subint_proc` tasks are
 								sibling trio tasks), so the timeout budget is ~3s total
 								rather than 3s × 25. After that, **25 abandoned
 								`daemon=True` driver threads are simultaneously alive** —
 								an extreme pressure multiplier on the same mechanism.
 								The `strace` fingerprint is striking under this load: six
 								or more **successful** `write(16, "\2", 1) = 1` calls
 								(main trio getting brief GIL slices, each long enough to
 								drain exactly one wakeup-pipe byte) before finally
 								saturating with `EAGAIN`:
 								```
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(16, "\2", 1)                      = 1
 								rt_sigreturn({mask=[WINCH]})            = 140141623162400
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(16, "\2", 1)                      = 1
 								rt_sigreturn({mask=[WINCH]})            = 140141623162400
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(16, "\2", 1)                      = 1
 								rt_sigreturn({mask=[WINCH]})            = 140141623162400
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(16, "\2", 1)                      = 1
 								rt_sigreturn({mask=[WINCH]})            = 140141623162400
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(16, "\2", 1)                      = 1
 								rt_sigreturn({mask=[WINCH]})            = 140141623162400
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(16, "\2", 1)                      = 1
 								rt_sigreturn({mask=[WINCH]})            = 140141623162400
 								--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 								write(16, "\2", 1)                      = -1 EAGAIN (Resource temporarily unavailable)
 								rt_sigreturn({mask=[WINCH]})            = 140141623162400
 								```
 								Those successful writes indicate CPython's
 								`sys.getswitchinterval()`-based GIL round-robin *is*
 								giving main brief slices — just never long enough to run
 								the Python-level signal handler through to the point
 								where trio converts the delivered SIGINT into a
 								`Cancelled` on the appropriate scope. Once the
 								accumulated write rate outpaces main's drain rate, the
 								pipe saturates and subsequent signals are silently
 								dropped.
 								The `pstree` below (pid `530060` = hung `pytest`) shows
 								the subint-driver thread population at the moment of
 								capture. Even with fewer than the full 25 shown (pstree
 								truncates thread names to `subint-driver[<interp_id>` —
 								interpreters `3` and `4` visible across 16 thread
 								entries), the GIL-contender count is more than enough to
 								explain the starvation:
 								```
 								>>> pstree -snapt 530060
 								systemd,1 --switched-root --system --deserialize=40
 								  └─login,1545 --
 								      └─bash,1872
 								          └─sway,2012
 								              └─alacritty,70471 -e xonsh
 								                  └─xonsh,70487 .../bin/xonsh
 								                      └─uv,70955 run xonsh
 								                          └─xonsh,70959 .../py314/bin/xonsh
 								                              └─python,530060 .../py314/bin/pytest -v tests/test_cancellation.py --spawn-backend=subint
 								                                  ├─{subint-driver[3},531857
 								                                  ├─{subint-driver[3},531860
 								                                  ├─{subint-driver[3},531862
 								                                  ├─{subint-driver[3},531866
 								                                  ├─{subint-driver[3},531877
 								                                  ├─{subint-driver[3},531882
 								                                  ├─{subint-driver[3},531884
 								                                  ├─{subint-driver[3},531945
 								                                  ├─{subint-driver[3},531950
 								                                  ├─{subint-driver[3},531952
 								                                  ├─{subint-driver[4},531956
 								                                  ├─{subint-driver[4},531959
 								                                  ├─{subint-driver[4},531961
 								                                  ├─{subint-driver[4},531965
 								                                  ├─{subint-driver[4},531968
 								                                  └─{subint-driver[4},531979
 								```
 								(`pstree` uses `{...}` to denote threads rather than
 								processes — these are all the **driver OS-threads** our
 								`subint_proc` creates with name
 								`f'subint-driver[{interp_id}]'`. Every one of them is
 								still alive, executing `_interpreters.exec()` inside a
 								sub-interpreter our hard-kill has abandoned. At 16+
 								abandoned driver threads competing for the main GIL, the
 								main-interpreter trio loop gets starved and signal
 								delivery stalls.)