Add posix-multithreaded-`fork()` explainer doc

Add todo for running `test_debugger` suite on forkserver spawner
Route `stackscope` SIGUSR1 onto trio loop
2026-04-29 12:50:23 -04:00 · 2026-04-29 12:49:36 -04:00 · 2026-04-29 12:01:03 -04:00
3 changed files with 385 additions and 18 deletions
--- a/ai/conc-anal/fork_thread_semantics_execution_vs_memory.md
+++ b/ai/conc-anal/fork_thread_semantics_execution_vs_memory.md
@ -0,0 +1,281 @@
 # `fork()` in a multi-threaded program — execution-side vs. memory-side of the same coin
 A reference doc for readers who've encountered one of two
 opposite-sounding framings of POSIX `fork()` semantics in a
 multi-threaded program and are confused by the other.
 This is a sibling to
 `subint_fork_blocked_by_cpython_post_fork_issue.md` — that
 doc covers a CPython-level refusal of fork-from-subint;
 this one covers the more general POSIX layer, since
 tractor's main-thread forkserver design rests on it.
 ## TL;DR
 POSIX `fork()` only preserves the *calling* thread as a
 runnable thread in the child — every other thread in the
 parent simply never executes another instruction in the
 child. trio's docs call this "leaked"; tractor's
 `_main_thread_forkserver.py` docstring calls it "gone".
 Both are correct: "gone" is the *execution* side (no
 scheduler entry, no instructions retired), "leaked" is the
 *memory* side (the dead threads' stacks and per-thread
 heap structures still ride into the child's address space
 as orphaned COW pages with no owner and no cleanup hook).
 Same POSIX reality, two halves of the same coin.
 ## The two framings
 [python-trio/trio#1614][trio-1614] (the canonical "trio +
 fork" hazards thread) puts it this way:
 > If you use `fork()` in a process with multiple threads,
 > all the other thread stacks are just leaked: there's
 > nothing else you can reasonably do with them.
 `tractor.spawn._main_thread_forkserver`'s module docstring
 (specifically the "What survives the fork? — POSIX
 semantics" section) puts it this way:
 > POSIX `fork()` only preserves the *calling* thread as a
 > runnable thread in the child. Every other thread in the
 > parent — trio's runner thread, any `to_thread` cache
 > threads, anything else — never executes another
 > instruction post-fork.
 A reader bouncing between the two can be forgiven for
 asking: well, *which* is it — leaked or gone?
 The answer is "yes". They're describing the same POSIX
 behavior from two different angles:
 - trio is talking about the **bytes** the dead threads
  leave behind — stacks, TLS slots, per-thread arena
  metadata — and the fact that nothing in the child can
  drive them forward, free them, or even safely walk
  them. That's a memory leak in the strict sense: held
  but unreachable.
 - tractor is talking about the **execution** side
  relevant to the forkserver design: which threads
  retire instructions in the child? Exactly one — the
  one that called `fork()`. Everything else, regardless
  of the bytes left behind, is dead in a scheduler
  sense.
 Neither framing is wrong; they're just answering
 different questions.
 ## POSIX `fork()` in a multi-threaded program — what actually happens
 Per POSIX (and concretely on Linux glibc), the contract
 of `fork()` in a multi-threaded process is:
 1. The kernel creates a new process whose virtual
   address space is a COW copy of the parent's. *All*
   pages map across — code, heap, every thread's stack,
   every malloc arena, every mmap region.
 2. Of the parent's N threads, exactly **one** is
   reified in the child as a runnable kernel task: the
   thread that called `fork()`. The other N-1 threads
   have *no* corresponding task in the child kernel. They
   were never scheduled, never `clone()`d for the child,
   never exist as runnable entities.
 3. Their **memory artifacts** — pthread stacks, TLS,
   `pthread_t` structures, glibc per-thread arena
   bookkeeping — are still mapped in the child's address
   space, because (1) duplicates *everything* page-wise.
   They sit there as inert COW bytes.
 4. The kernel does not clean those bytes up. There is no
   "phantom-thread cleanup" pass post-fork. The kernel
   doesn't know which mapped pages "belonged to" which
   thread — at the kernel level mappings are
   process-scoped, not thread-scoped.
 5. The surviving thread (the caller of `fork()`) cannot
   safely access those leaked bytes either. Any state
   they encoded — held mutexes, in-flight syscalls,
   half-updated invariants — is frozen at whatever
   instant the parent's fork-syscall observed it. Some
   of those mutexes may even still be locked from the
   child's POV (the canonical "fork-in-multithreaded-
   program-deadlocks" hazard; see `man pthread_atfork`).
 So: from the kernel's PoV, the child has one thread.
 From the address-space's PoV, the child has all the
 parent's bytes — including the corpses of the N-1 dead
 threads' stacks. Both true simultaneously.
 ## Why trio says "leaked"
 trio's framing makes sense from the parent's
 PoV, looking at *what those threads were doing*. In a
 running `trio.run()` process you typically have:
 - The trio runner thread itself — owns the `selectors`
  epoll fd, the signal-wakeup-fd, the run-queue.
 - Threadpool worker threads (`trio.to_thread`'s cache)
  — blocked in `wait()` on the threadpool's work
  condvar.
 - Whatever other ad-hoc threads the application
  started.
 Each of those threads owns *real work-state*: epoll
 registrations, file descriptors held in
 soon-to-be-completed reads, half-released locks, posted
 but unconsumed wakeups. After fork, that state is still
 encoded in the child's memory. None of it is invalid in
 a well-formed-bytes sense. It's just that:
 - The thread that was driving it is gone.
 - Nothing else in the child knows the layout well
  enough to take over.
 - Even if it did, the kernel objects backing the work
  (epoll fd, signalfd) have separate post-fork
  semantics that don't compose with userland trio
  state.
 So the bytes are *held* (they're in the child's
 address space, they count against RSS, they survive
 until something clobbers them), and they're
 *unreachable* in any meaningful sense — no thread can
 safely drive them forward. That is the textbook
 definition of a leak.
 trio's quote is reminding the user that `fork()` from a
 multi-threaded process is a one-way memory hazard:
 whatever those threads were doing, that work-state is
 now garbage you happen to still be carrying.
 ## Why tractor says "gone"
 tractor's `_main_thread_forkserver` framing is concerned
 with a different question: *which thread executes in the
 child, and is it safe?*
 The forkserver design rests on POSIX's "calling thread
 is the sole survivor" guarantee. We pick that calling
 thread very deliberately: a dedicated worker that has
 provably never entered trio. So the thread that *does*
 run in the child is one whose locals, TLS, and stack
 contain nothing trio-related. Trio's runner thread —
 the one that owned the epoll fd and the run-queue — is
 *gone* from the child in the execution sense. It will
 never run another instruction. The fact that its stack
 bytes still exist in the child's address space (the
 "leaked" view) is irrelevant to the forkserver, because
 nothing in the child reads or writes those pages.
 So when the docstring says "Every other thread … is
 gone the instant `fork()` returns in the child", it's
 being precise about the surface that matters for the
 backend: scheduler-level liveness. Nothing schedules
 those threads ever again. Whether their bytes are
 hanging around is a separate (and, for the design,
 non-load-bearing) fact.
 ## Cross-table
 The same tabular layout the `_main_thread_forkserver`
 docstring uses, expanded with a fourth "what handles
 it" column:
 | thread              | parent    | child (executing) | child (memory)               | what handles it             |
 |---------------------|-----------|-------------------|------------------------------|-----------------------------|
 | forkserver worker   | continues | sole survivor     | live stack                   | runs the child's bootstrap  |
 | `trio.run()` thread | continues | not running       | leaked stack (zombie bytes)  | overwritten by child's fresh `trio.run()` |
 | any other thread    | continues | not running       | leaked stack (zombie bytes)  | overwritten / GC'd / clobbered by `exec()` if used |
 The "child (executing)" column is the *execution* side
 of the coin — what tractor cares about. The "child
 (memory)" column is the *memory* side — what trio
 cares about.
 The "what handles it" column is the deliberate punchline
 of the design: nothing has to handle the leaked bytes
 *explicitly*. They get clobbered by ordinary forward
 progress in the child:
 - The fresh `trio.run()` the child boots up allocates
  its own stack, scheduler, and run-queue, which over
  time overlaps and overwrites the inherited zombie
  pages.
 - Python's GC walks live objects only; the dead-thread
  Python frames aren't reachable from any
  `PyThreadState`, so they get freed at the next
  collection cycle.
 - If the child eventually `exec()`s, the entire address
  space is replaced and the leak vanishes.
 ## What this means for the forkserver design
 The crucial point is that **the design doesn't and
 *can't* prevent the leak**. There is no userland fix
 for COW thread stacks. The kernel hands the child a
 duplicated address space; that's what `fork()` *is*. No
 amount of pre-fork hookery, `pthread_atfork()`
 gymnastics, or post-fork cleanup can un-COW the dead
 threads' pages without unmapping them, and unmapping
 arbitrary regions of a duplicated address space is
 neither portable nor safe.
 What the design *does* ensure is the orthogonal
 property: the survivor thread is one that doesn't need
 any of that leaked state to function. Concretely:
 - Survivor is the forkserver worker thread.
 - That worker has provably never imported, called into,
  or held any reference to `trio`. (Enforced by keeping
  the worker's lifecycle entirely in
  `_main_thread_forkserver.py` and never letting trio
  task-state cross into it.)
 - So the leaked pages — trio runner stack, threadpool
  caches, etc. — are inert relative to the survivor.
  No code path in the child references them.
 - The child then boots its own fresh `trio.run()`,
  which allocates new state in new pages. Over the
  child's lifetime the COW'd zombie pages get
  overwritten, GC'd, or (if the child eventually
  `exec()`s) discarded wholesale.
 The "leak" is real but inert. It costs RSS until
 clobbered; it doesn't cost correctness. That's exactly
 the property the forkserver pattern is built on, and
 it's also why the design needs the "calling thread is
 trio-free" precondition to be airtight: if the survivor
 were a trio thread, it *would* try to drive the leaked
 trio state, and the leak would no longer be inert.
 ## See also
 - `tractor/spawn/_main_thread_forkserver.py` — module
  docstring's "What survives the fork? — POSIX
  semantics" section is the in-tree, code-adjacent
  prose this doc expands on. The cross-table here is a
  fourth-column expansion of the table there.
 - [python-trio/trio#1614][trio-1614] — the trio issue
  with the "leaked" framing, and the canonical thread
  for trio + `fork()` hazards more broadly.
 - [`subint_fork_blocked_by_cpython_post_fork_issue.md`](./subint_fork_blocked_by_cpython_post_fork_issue.md)
  — sibling analysis covering CPython's *post-fork*
  hooks (`PyOS_AfterFork_Child`,
  `_PyInterpreterState_DeleteExceptMain`) and why
  fork-from-non-main-subint is a CPython-level hard
  refusal. Complementary axis: this doc is about POSIX
  semantics; that doc is about the CPython runtime
  layer that runs *after* POSIX `fork()` returns in
  the child.
 - `man pthread_atfork(3)` — canonical "fork in a
  multithreaded process is dangerous" reference.
  Especially the rationale section, which is the
  closest thing to a normative statement of "the
  surviving thread cannot safely use anything the dead
  threads were touching."
 - `man fork(2)` (Linux) — "Other than [the calling
  thread], … no other threads are replicated …"
  paragraph is the kernel-side statement of the
  execution-side framing this doc opens with.
 [trio-1614]: https://github.com/python-trio/trio/issues/1614
--- a/tests/devx/conftest.py
+++ b/tests/devx/conftest.py
@ -65,9 +65,18 @@ def spawn(
    run an `./examples/..` script by name.
    '''
-    if start_method != 'trio':
+    supported_spawners: set[str] = {
        'trio',
        # ?TODO, other spawners that will work?
        # - [ ] need to pass `start_method={spawner}` to underlying
        #      `examples/debugging/<script>.py` somehow?
        # 'main_thread_forkserver',
        # 'subint_forkserver',
    }
    if start_method not in supported_spawners:
        pytest.skip(
-            '`pexpect` based tests only supported on `trio` backend'
+            f'`pexpect` based tests NOT supported on spawning-backend: {start_method!r}\n'
            f'supported-spawners: {supported_spawners!r}'
        )
    def unset_colors():
@ -148,21 +157,38 @@ def spawn(
 def ctlc(
    request: pytest.FixtureRequest,
    ci_env: bool,
-
+    start_method: str,
 ) -> bool:
    '''
    Parametrize and optionally skip tests which handle
    ctlc-in-`pdbp`-REPL testing scenarios; certain spawners and actor-tree depths
    cope very poorly with this..
    In particular the spawning backends from `multiprocessing` are
    fragile, as can be the default `trio` spawner under certain
    conditions where SIGINT is relayed down the entire subproc tree.
    '''
    use_ctlc: bool = request.param
    node = request.node
    markers = node.own_markers
    for mark in markers:
-        if mark.name == 'has_nested_actors':
+        if (
            mark.name == 'has_nested_actors'
            and
            start_method not in {
                # TODO, any spawners we should try again?
                # - [ ] 'trio' but WITHOUT the SIGINT handler setup
                #      per subproc?
                # 'main_thread_forkserver',
            }
        ):
            pytest.skip(
                f'Test {node} has nested actors and fails with Ctrl-C.\n'
                f'The test can sometimes run fine locally but until'
                ' we solve' 'this issue this CI test will be xfail:\n'
                'https://github.com/goodboy/tractor/issues/320'
            )
        if (
            mark.name == 'ctlcs_bish'
            and
--- a/tractor/devx/_stackscope.py
+++ b/tractor/devx/_stackscope.py
@ -47,7 +47,9 @@ from typing import (
 import trio
 from tractor.runtime import _state
 from tractor import log as logmod
-from tractor.devx import debug
+from tractor.devx import (
    debug,
 )
 log = logmod.get_logger()
@ -109,16 +111,29 @@ def dump_task_tree() -> None:
    # |_{Supervisor/Scope
    # |_[Storage/Memory/IPC-Stream/Data-Struct
    fpath: str = f'/tmp/tractor-stackscope-{os.getpid()}.log'
    from . import _pformat
    actor_repr: str = _pformat.nest_from_op(
        input_op='|_',
        text=f'{actor}',
        nest_prefilx='|_',
        nest_indent=3,
    )
    full_dump: str = (
        f'Dumping `stackscope` tree for actor\n'
        f'(>: {actor.uid!r}\n'
        f' |_{mp.current_process()}\n'
        f'   |_{thr}\n'
-        f'     |_{actor}\n'
+        # TODO, use the nest_from_op
        f'{actor_repr}'
        # f'     |_{actor}'
        f'\n'
        f'{sigint_handler_report}\n'
        f'signal.getsignal(SIGINT) -> {current_sigint_handler!r}\n'
        f'\n'
        f'capture-bypass tee: {fpath}\n'
        f'(`tail -f {fpath}` to follow across signals)\n'
        f'\n'
        f'------ start-of-{actor.uid!r} ------\n'
        f'|\n'
        f'{tree_str}'
@ -131,7 +146,6 @@ def dump_task_tree() -> None:
    # `--capture=fd` swallows `log.devx()` above; the
    # following two writes guarantee the dump reaches the
    # human even when stdio is captured.
    fpath: str = f'/tmp/tractor-stackscope-{os.getpid()}.log'
    try:
        with open(fpath, 'a') as f:
            f.write(full_dump + '\n')
@ -151,6 +165,34 @@ def dump_task_tree() -> None:
 _handler_lock = RLock()
 _tree_dumped: bool = False
 # Captured at `enable_stack_on_sig()` time when running
 # inside a trio task. `dump_tree_on_sig` uses this to
 # schedule `dump_task_tree` ON the trio loop via
 # `token.run_sync_soon` so stackscope sees a real current
 # task and can recurse into nursery children. Without
 # it (signal handler running in a non-trio stack frame),
 # `stackscope.extract` only walks the `<init>` task and
 # misses everything inside `async_main`'s nurseries.
 _trio_token: trio.lowlevel.TrioToken|None = None
 def _safe_dump_task_tree() -> None:
    '''
    `run_sync_soon`-friendly wrapper that swallows any
    exception from `dump_task_tree`. Trio prints
    + crashes on uncaught exceptions in scheduled
    callbacks; we'd rather log + keep the test running so
    the user can re-trigger the dump.
    '''
    try:
        dump_task_tree()
    except BaseException:
        log.exception(
            '`dump_task_tree()` raised (scheduled via '
            '`run_sync_soon`); continuing.\n'
        )
 def dump_tree_on_sig(
    sig: int,
@ -174,16 +216,17 @@ def dump_tree_on_sig(
            'Trying to dump `stackscope` tree..\n'
        )
        try:
-            dump_task_tree()
+            # Prefer scheduling on the trio loop — runs the
-            # await actor._service_n.start_soon(
+            # dump from a real trio-task context so
-            #     partial(
+            # `stackscope.extract(recurse_child_tasks=True)`
-            #         trio.to_thread.run_sync,
+            # walks every nursery child instead of seeing
-            #         dump_task_tree,
+            # only the `<init>` task. Falls back to a direct
-            #     )
+            # call when no token was captured (e.g. signal
-            # )
+            # delivered outside a trio.run).
-            # trio.lowlevel.current_trio_token().run_sync_soon(
+            if _trio_token is not None:
-            #     dump_task_tree
+                _trio_token.run_sync_soon(_safe_dump_task_tree)
-            # )
+            else:
                dump_task_tree()
        except RuntimeError:
            log.exception(
@ -269,11 +312,27 @@ def enable_stack_on_sig(
        )
        return None
    # Capture the trio token if we're inside `trio.run`
    # so SIGUSR1 dispatches the dump *onto* the trio loop
    # (full task-tree visibility). When called outside trio
    # (e.g. from `pytest_configure`), token capture fails
    # silently and `dump_tree_on_sig` falls back to the
    # direct-call path.
    global _trio_token
    try:
        _trio_token = trio.lowlevel.current_trio_token()
    except RuntimeError:
        # not in a `trio.run` — leave None; runtime can
        # re-call `enable_stack_on_sig()` later from
        # inside `async_main` to capture it.
        _trio_token = None
    handler: Callable|int = getsignal(sig)
    if handler is dump_tree_on_sig:
        log.devx(
            'A `SIGUSR1` handler already exists?\n'
            f'|_ {handler!r}\n'
            f'(trio_token captured: {_trio_token is not None})\n'
        )
        return
@ -287,5 +346,6 @@ def enable_stack_on_sig(
        f'{stackscope!r}\n\n'
        f'With `SIGUSR1` handler\n'
        f'|_{dump_tree_on_sig}\n'
        f'(trio_token captured: {_trio_token is not None})\n'
    )
    return stackscope