Add posix-multithreaded-`fork()` explainer doc

2026-04-29 12:50:23 -04:00 · 2026-04-29 12:50:23 -04:00 · 532a9834f3
parent 2917b74ba4
commit 532a9834f3
1 changed files with 281 additions and 0 deletions
--- a/ai/conc-anal/fork_thread_semantics_execution_vs_memory.md
+++ b/ai/conc-anal/fork_thread_semantics_execution_vs_memory.md
@ -0,0 +1,281 @@
+# `fork()` in a multi-threaded program — execution-side vs. memory-side of the same coin
+
+A reference doc for readers who've encountered one of two
+opposite-sounding framings of POSIX `fork()` semantics in a
+multi-threaded program and are confused by the other.
+
+This is a sibling to
+`subint_fork_blocked_by_cpython_post_fork_issue.md` — that
+doc covers a CPython-level refusal of fork-from-subint;
+this one covers the more general POSIX layer, since
+tractor's main-thread forkserver design rests on it.
+
+## TL;DR
+
+POSIX `fork()` only preserves the *calling* thread as a
+runnable thread in the child — every other thread in the
+parent simply never executes another instruction in the
+child. trio's docs call this "leaked"; tractor's
+`_main_thread_forkserver.py` docstring calls it "gone".
+Both are correct: "gone" is the *execution* side (no
+scheduler entry, no instructions retired), "leaked" is the
+*memory* side (the dead threads' stacks and per-thread
+heap structures still ride into the child's address space
+as orphaned COW pages with no owner and no cleanup hook).
+Same POSIX reality, two halves of the same coin.
+
+## The two framings
+
+[python-trio/trio#1614][trio-1614] (the canonical "trio +
+fork" hazards thread) puts it this way:
+
+> If you use `fork()` in a process with multiple threads,
+> all the other thread stacks are just leaked: there's
+> nothing else you can reasonably do with them.
+
+`tractor.spawn._main_thread_forkserver`'s module docstring
+(specifically the "What survives the fork? — POSIX
+semantics" section) puts it this way:
+
+> POSIX `fork()` only preserves the *calling* thread as a
+> runnable thread in the child. Every other thread in the
+> parent — trio's runner thread, any `to_thread` cache
+> threads, anything else — never executes another
+> instruction post-fork.
+
+A reader bouncing between the two can be forgiven for
+asking: well, *which* is it — leaked or gone?
+
+The answer is "yes". They're describing the same POSIX
+behavior from two different angles:
+
+- trio is talking about the **bytes** the dead threads
+  leave behind — stacks, TLS slots, per-thread arena
+  metadata — and the fact that nothing in the child can
+  drive them forward, free them, or even safely walk
+  them. That's a memory leak in the strict sense: held
+  but unreachable.
+- tractor is talking about the **execution** side
+  relevant to the forkserver design: which threads
+  retire instructions in the child? Exactly one — the
+  one that called `fork()`. Everything else, regardless
+  of the bytes left behind, is dead in a scheduler
+  sense.
+
+Neither framing is wrong; they're just answering
+different questions.
+
+## POSIX `fork()` in a multi-threaded program — what actually happens
+
+Per POSIX (and concretely on Linux glibc), the contract
+of `fork()` in a multi-threaded process is:
+
+1. The kernel creates a new process whose virtual
+   address space is a COW copy of the parent's. *All*
+   pages map across — code, heap, every thread's stack,
+   every malloc arena, every mmap region.
+2. Of the parent's N threads, exactly **one** is
+   reified in the child as a runnable kernel task: the
+   thread that called `fork()`. The other N-1 threads
+   have *no* corresponding task in the child kernel. They
+   were never scheduled, never `clone()`d for the child,
+   never exist as runnable entities.
+3. Their **memory artifacts** — pthread stacks, TLS,
+   `pthread_t` structures, glibc per-thread arena
+   bookkeeping — are still mapped in the child's address
+   space, because (1) duplicates *everything* page-wise.
+   They sit there as inert COW bytes.
+4. The kernel does not clean those bytes up. There is no
+   "phantom-thread cleanup" pass post-fork. The kernel
+   doesn't know which mapped pages "belonged to" which
+   thread — at the kernel level mappings are
+   process-scoped, not thread-scoped.
+5. The surviving thread (the caller of `fork()`) cannot
+   safely access those leaked bytes either. Any state
+   they encoded — held mutexes, in-flight syscalls,
+   half-updated invariants — is frozen at whatever
+   instant the parent's fork-syscall observed it. Some
+   of those mutexes may even still be locked from the
+   child's POV (the canonical "fork-in-multithreaded-
+   program-deadlocks" hazard; see `man pthread_atfork`).
+
+So: from the kernel's PoV, the child has one thread.
+From the address-space's PoV, the child has all the
+parent's bytes — including the corpses of the N-1 dead
+threads' stacks. Both true simultaneously.
+
+## Why trio says "leaked"
+
+trio's framing makes sense from the parent's
+PoV, looking at *what those threads were doing*. In a
+running `trio.run()` process you typically have:
+
+- The trio runner thread itself — owns the `selectors`
+  epoll fd, the signal-wakeup-fd, the run-queue.
+- Threadpool worker threads (`trio.to_thread`'s cache)
+  — blocked in `wait()` on the threadpool's work
+  condvar.
+- Whatever other ad-hoc threads the application
+  started.
+
+Each of those threads owns *real work-state*: epoll
+registrations, file descriptors held in
+soon-to-be-completed reads, half-released locks, posted
+but unconsumed wakeups. After fork, that state is still
+encoded in the child's memory. None of it is invalid in
+a well-formed-bytes sense. It's just that:
+
+- The thread that was driving it is gone.
+- Nothing else in the child knows the layout well
+  enough to take over.
+- Even if it did, the kernel objects backing the work
+  (epoll fd, signalfd) have separate post-fork
+  semantics that don't compose with userland trio
+  state.
+
+So the bytes are *held* (they're in the child's
+address space, they count against RSS, they survive
+until something clobbers them), and they're
+*unreachable* in any meaningful sense — no thread can
+safely drive them forward. That is the textbook
+definition of a leak.
+
+trio's quote is reminding the user that `fork()` from a
+multi-threaded process is a one-way memory hazard:
+whatever those threads were doing, that work-state is
+now garbage you happen to still be carrying.
+
+## Why tractor says "gone"
+
+tractor's `_main_thread_forkserver` framing is concerned
+with a different question: *which thread executes in the
+child, and is it safe?*
+
+The forkserver design rests on POSIX's "calling thread
+is the sole survivor" guarantee. We pick that calling
+thread very deliberately: a dedicated worker that has
+provably never entered trio. So the thread that *does*
+run in the child is one whose locals, TLS, and stack
+contain nothing trio-related. Trio's runner thread —
+the one that owned the epoll fd and the run-queue — is
+*gone* from the child in the execution sense. It will
+never run another instruction. The fact that its stack
+bytes still exist in the child's address space (the
+"leaked" view) is irrelevant to the forkserver, because
+nothing in the child reads or writes those pages.
+
+So when the docstring says "Every other thread … is
+gone the instant `fork()` returns in the child", it's
+being precise about the surface that matters for the
+backend: scheduler-level liveness. Nothing schedules
+those threads ever again. Whether their bytes are
+hanging around is a separate (and, for the design,
+non-load-bearing) fact.
+
+## Cross-table
+
+The same tabular layout the `_main_thread_forkserver`
+docstring uses, expanded with a fourth "what handles
+it" column:
+
+| thread              | parent    | child (executing) | child (memory)               | what handles it             |
+|---------------------|-----------|-------------------|------------------------------|-----------------------------|
+| forkserver worker   | continues | sole survivor     | live stack                   | runs the child's bootstrap  |
+| `trio.run()` thread | continues | not running       | leaked stack (zombie bytes)  | overwritten by child's fresh `trio.run()` |
+| any other thread    | continues | not running       | leaked stack (zombie bytes)  | overwritten / GC'd / clobbered by `exec()` if used |
+
+The "child (executing)" column is the *execution* side
+of the coin — what tractor cares about. The "child
+(memory)" column is the *memory* side — what trio
+cares about.
+
+The "what handles it" column is the deliberate punchline
+of the design: nothing has to handle the leaked bytes
+*explicitly*. They get clobbered by ordinary forward
+progress in the child:
+
+- The fresh `trio.run()` the child boots up allocates
+  its own stack, scheduler, and run-queue, which over
+  time overlaps and overwrites the inherited zombie
+  pages.
+- Python's GC walks live objects only; the dead-thread
+  Python frames aren't reachable from any
+  `PyThreadState`, so they get freed at the next
+  collection cycle.
+- If the child eventually `exec()`s, the entire address
+  space is replaced and the leak vanishes.
+
+## What this means for the forkserver design
+
+The crucial point is that **the design doesn't and
+*can't* prevent the leak**. There is no userland fix
+for COW thread stacks. The kernel hands the child a
+duplicated address space; that's what `fork()` *is*. No
+amount of pre-fork hookery, `pthread_atfork()`
+gymnastics, or post-fork cleanup can un-COW the dead
+threads' pages without unmapping them, and unmapping
+arbitrary regions of a duplicated address space is
+neither portable nor safe.
+
+What the design *does* ensure is the orthogonal
+property: the survivor thread is one that doesn't need
+any of that leaked state to function. Concretely:
+
+- Survivor is the forkserver worker thread.
+- That worker has provably never imported, called into,
+  or held any reference to `trio`. (Enforced by keeping
+  the worker's lifecycle entirely in
+  `_main_thread_forkserver.py` and never letting trio
+  task-state cross into it.)
+- So the leaked pages — trio runner stack, threadpool
+  caches, etc. — are inert relative to the survivor.
+  No code path in the child references them.
+- The child then boots its own fresh `trio.run()`,
+  which allocates new state in new pages. Over the
+  child's lifetime the COW'd zombie pages get
+  overwritten, GC'd, or (if the child eventually
+  `exec()`s) discarded wholesale.
+
+The "leak" is real but inert. It costs RSS until
+clobbered; it doesn't cost correctness. That's exactly
+the property the forkserver pattern is built on, and
+it's also why the design needs the "calling thread is
+trio-free" precondition to be airtight: if the survivor
+were a trio thread, it *would* try to drive the leaked
+trio state, and the leak would no longer be inert.
+
+## See also
+
+- `tractor/spawn/_main_thread_forkserver.py` — module
+  docstring's "What survives the fork? — POSIX
+  semantics" section is the in-tree, code-adjacent
+  prose this doc expands on. The cross-table here is a
+  fourth-column expansion of the table there.
+
+- [python-trio/trio#1614][trio-1614] — the trio issue
+  with the "leaked" framing, and the canonical thread
+  for trio + `fork()` hazards more broadly.
+
+- [`subint_fork_blocked_by_cpython_post_fork_issue.md`](./subint_fork_blocked_by_cpython_post_fork_issue.md)
+  — sibling analysis covering CPython's *post-fork*
+  hooks (`PyOS_AfterFork_Child`,
+  `_PyInterpreterState_DeleteExceptMain`) and why
+  fork-from-non-main-subint is a CPython-level hard
+  refusal. Complementary axis: this doc is about POSIX
+  semantics; that doc is about the CPython runtime
+  layer that runs *after* POSIX `fork()` returns in
+  the child.
+
+- `man pthread_atfork(3)` — canonical "fork in a
+  multithreaded process is dangerous" reference.
+  Especially the rationale section, which is the
+  closest thing to a normative statement of "the
+  surviving thread cannot safely use anything the dead
+  threads were touching."
+
+- `man fork(2)` (Linux) — "Other than [the calling
+  thread], … no other threads are replicated …"
+  paragraph is the kernel-side statement of the
+  execution-side framing this doc opens with.
+
+[trio-1614]: https://github.com/python-trio/trio/issues/1614