Fix `SharedMemory` under `subint_forkserver`

Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.

Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
  `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
  now run unconditionally:
  * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
    `ManTracker` subclass (empty `register` / `unregister`
    / `ensure_running`).
  * return `partial(SharedMemory, track=False)` for the per-allocation
    opt-out.
  * belt + suspenders: even if something dodges the wrapper, the
    singleton can't talk to the inherited (broken) parent fd.

- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
  skip of the unlink-callback; install a `try_unlink()` wrapper that
  swallows `FileNotFoundError` (sibling-already-cleaned race in
  shared-key setups). Without `mp.resource_tracker` doing it for us, we
  own the unlink — `actor. lifetime_stack` is the right place since
  tractor already controls actor lifecycle.

- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
  module-level skip- list (tests pass now). Inline comment cross-refs
  the two `_mp_bs` / `_shm` workarounds.

- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
  rewrite — flips status from "open / unresolvable in tractor" to
  "resolved, kept as decision record". Adds Resolution section, "Why
  this is the right call" rationale (mp tracker is widely criticized;
  tractor already owns lifecycle), trade-offs (crash-leaked segments,
  lost mp leak warning), verification (7 passed under both
  `subint_forkserver` and `trio` backends), and upstream issue links

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-04-27 10:51:28 -04:00
parent c99d475d03
commit aa3e230926
4 changed files with 181 additions and 122 deletions

View File

@ -1,24 +1,29 @@
# `subint_forkserver` × `multiprocessing.SharedMemory`: incompatible-by-mp-design # `subint_forkserver` × `multiprocessing.SharedMemory`: fork-inherited `resource_tracker` fd
Surfaced by `tests/test_shm.py` under Surfaced by `tests/test_shm.py` under
`--spawn-backend=subint_forkserver`. Both test functions `--spawn-backend=subint_forkserver`. Two distinct
fail with distinct symptoms that share one root cause: failure modes, one root cause:
**`multiprocessing.resource_tracker` is fork-without-exec **`multiprocessing.resource_tracker` is fork-without-exec
unsafe.** unsafe** (canonical CPython class — bpo-38119, bpo-45209).
**Status: resolved by `tractor/ipc/_mp_bs.py` +
`tractor/ipc/_shm.py` changes (see "Resolution" below).
This doc kept as the
post-mortem / decision record.**
## TL;DR ## TL;DR
`mp.shared_memory.SharedMemory` registers each shm `mp.shared_memory.SharedMemory` registers each shm
allocation with the per-process allocation with the per-process
`multiprocessing.resource_tracker` singleton. The `multiprocessing.resource_tracker` singleton. The
tracker is a daemon process started lazily, and the tracker is a daemon process started lazily; the
parent owns a unix-pipe-fd to it. When the parent parent owns a unix-pipe-fd to it. When the parent
forks-without-execing into a `subint_forkserver` forks-without-execing into a `subint_forkserver`
child, the child inherits that fd — but the fd refers child, the child inherits that fd — but it refers to
to the *parent's* tracker, which the child has no the *parent's* tracker, which the child has no
business writing to. business writing to.
Two manifestations: Two manifestations under the original (pre-fix) code:
1. **`test_child_attaches_alot`** — child loops 1000× 1. **`test_child_attaches_alot`** — child loops 1000×
`attach_shm_list()`. First `mp.SharedMemory` call `attach_shm_list()`. First `mp.SharedMemory` call
@ -37,89 +42,140 @@ Two manifestations:
`FileExistsError: '/shm_list'` because the leak `FileExistsError: '/shm_list'` because the leak
persists across the parametrize loop and forkserver persists across the parametrize loop and forkserver
children can't `shm_open(create=True)` an existing children can't `shm_open(create=True)` an existing
key. Trio backend doesn't surface this because key.
Trio backend (`mp_spawn`-style) doesn't surface this:
each subactor `exec`s a fresh interpreter → each subactor `exec`s a fresh interpreter →
independent resource tracker per subactor → no independent resource tracker per subactor → no
inherited-fd issue, and the test's pre-existing inherited-fd issue, and the test's pre-existing leak
leak is masked by the per-process tracker reset. gets masked by the per-process tracker reset.
## Why trio backend works Under `subint_forkserver`, the child is `os.fork()`'d
from a worker thread (no `exec`) → inherits parent's
`mp.resource_tracker._resource_tracker._fd` → EBADF
/ cross-talk on first `mp.SharedMemory` op.
Under `--spawn-backend=trio`, each subactor is born ## Resolution
via `python -m tractor._child` (full `execve`) →
fresh interpreter → fresh module-level globals →
`mp.resource_tracker._resource_tracker` is `None`
until first use → `mp.SharedMemory` constructs its
own tracker, talks to its own pipe-fd. No cross-
process fd inheritance.
Under `subint_forkserver`, the child is We side-step the broken upstream machinery entirely
`os.fork()`'d from a worker thread of the parent rather than try to make it fork-safe. Two-part fix
(no `exec`) → inherits parent's landed (commits to follow this doc):
`mp.resource_tracker._resource_tracker._fd`
EBADF / cross-talk on first `mp.SharedMemory`
operation in the child.
## Status ### 1. `tractor/ipc/_mp_bs.py::disable_mantracker()`
— unconditional disable
**Not a tractor bug.** This is the canonical The previous "3.13+ short-circuit" path used
"fork-without-exec breaks `multiprocessing` `partial(SharedMemory, track=False)` to opt-out of
internals" class — see CPython issues: registration on 3.13+. The `track=False` switch is
necessary but not sufficient under fork: the
inherited tracker fd can still be touched indirectly
(e.g. through `_ensure_running_and_write`'s
self-check path).
- https://bugs.python.org/issue38119 The fix takes both belts AND suspenders:
- https://bugs.python.org/issue45209
Pure-`fork` start method has the same incompatibility; - **Always** monkey-patch
that's why `mp` itself defaults to `spawn` on macOS `mp.resource_tracker._resource_tracker` to a
and `forkserver`/`spawn` on Linux post-3.14. no-op `ManTracker` subclass whose
`register`/`unregister`/`ensure_running` are all
empty.
- **Always** wrap `SharedMemory` with
`track=False`.
## Mitigation Result: the inherited tracker fd in the fork child
is still inherited (fd is a kernel object; we can't
un-inherit it across fork) but **nothing in the
shm code path will ever try to use it** — both the
tracker singleton and the per-allocation registration
are short-circuited.
`tests/test_shm.py` is module-marked with ### 2. `tractor/ipc/_shm.py::open_shm_list()`
`pytest.mark.skipon_spawn_backend('subint_forkserver', — own the cleanup
'subint', reason=...)` pointing at this doc.
Two longer-term options if we ever want shm tests under Without `mp.resource_tracker`, nobody else will
`subint_forkserver`: unlink leaked segments at process exit. tractor
already controls actor lifecycle, so we register
unlink on the actor's lifetime stack:
1. **Reset the inherited tracker fd in the child ```python
prelude** — def try_unlink():
`tractor/spawn/_subint_forkserver.py::_child_target` try:
already calls `_close_inherited_fds()`. We could shml.shm.unlink()
additionally explicitly clear except FileNotFoundError as fne:
`multiprocessing.resource_tracker._resource_tracker` log.exception(...) # benign sibling-already-cleaned race
so the child re-creates a fresh tracker on first
shm op. **Caveat**: this means each
forkserver-subactor spawns its own resource-tracker
daemon-process, multiplying daemon-proc count by
subactor count. mp authors deliberately avoided
this — the tracker is meant to be a per-mp-context
singleton.
2. **Stop using `multiprocessing.shared_memory`** actor.lifetime_stack.callback(try_unlink)
migrate to `posix_ipc` directly (no resource ```
tracker) or finish the `hotbaud`-based ringbuf
transport that already supersedes shm in many
`tractor` IPC paths.
Neither is in scope for the The `FileNotFoundError` swallow handles the case
`subint_forkserver`-backend-lands PR; both are tracked where a sibling actor already unlinked the same
out as future work. segment (legitimate race in shared-key setups).
## Reproducer ## Why this is the right call
- **mp's tracker is widely criticized.** The
in-tree comment "non-SC madness" predates this
fix and matches CPython upstream's own discomfort
(e.g. the per-context tracker design rework
discussions in bpo-43475).
- **tractor already owns process lifecycle.** We
have `actor.lifetime_stack`, `Portal.cancel_actor`,
and the IPC cancel cascade. Adding mp's tracker
on top buys nothing we can't do better ourselves.
- **Backend-uniform.** No special-casing per spawn
backend. trio (`mp_spawn`-style), `subint_forkserver`,
and the future `subint` all behave identically
— register-time no-op, exit-time unlink-via-
lifetime-stack.
## Trade-offs / known gaps
- **Crash-leaked segments.** If an actor segfaults
or is `SIGKILL`'d before its lifetime stack runs,
`/dev/shm/<key>` will leak. Mitigations:
- `tractor-reap` (the new
`scripts/tractor-reap` CLI) doesn't yet sweep
`/dev/shm` — could extend it.
- Higher-level apps using shm should pin a UUID
into the key (the `'shml_<uuid>'` pattern in
`test_child_attaches_alot`) so leaks are
distinct per session and easy to GC out-of-band.
- **Cross-actor unlink races.** Two actors holding
the same shm key racing on `unlink()` — handled
by the `FileNotFoundError` swallow.
- **Crashes won't show up in mp's leak warning.**
We've turned off `resource_tracker`, so the usual
`resource_tracker: There appear to be N leaked
shared_memory objects to clean up at shutdown`
warning is gone too. If we ever want it back as
a crash-detection signal, we'd need our own
equivalent (walk the actor's `_shm_list_keys` set
at root teardown, log any unfreed).
## Verification
```sh ```sh
# fail mode 1 (EBADF on resource_tracker._fd): # fixed under both backends:
./py314/bin/python -m pytest \ ./py314/bin/python -m pytest tests/test_shm.py \
tests/test_shm.py::test_child_attaches_alot \
--spawn-backend=subint_forkserver --tb=short
# fail mode 2 (FileExistsError on /shm_list):
./py314/bin/python -m pytest \
tests/test_shm.py::test_parent_writer_child_reader \
--spawn-backend=subint_forkserver --spawn-backend=subint_forkserver
# 7 passed
# baseline (passes): ./py314/bin/python -m pytest tests/test_shm.py \
./py314/bin/python -m pytest \ --spawn-backend=trio
tests/test_shm.py --spawn-backend=trio # 7 passed (regression check)
``` ```
## References
- CPython upstream issues:
- https://bugs.python.org/issue38119 (fork
+ resource_tracker fd inheritance)
- https://bugs.python.org/issue45209
(SharedMemory + resource_tracker)
- https://bugs.python.org/issue43475
(per-context tracker rework discussion)
- Long-term alternative: migrate off
`multiprocessing.shared_memory` entirely to
`posix_ipc` (no tracker) or finish the
`hotbaud`-based ringbuf transport. Not blocked on
this fix — both are independently tracked.

View File

@ -17,6 +17,10 @@ from tractor.ipc._shm import (
pytestmark = pytest.mark.skipon_spawn_backend( pytestmark = pytest.mark.skipon_spawn_backend(
'subint', 'subint',
# 'subint_forkserver', # 'subint_forkserver',
# XXX we hack around this stdlib limitation by both,
# - setting `ShareMemory(track=False)`
# - overriding the `mp.ResourceTracker` nonsense in
# `.ipc._mp_bs`.
reason=( reason=(
'subint: GIL-contention hanging class.\n' 'subint: GIL-contention hanging class.\n'
'subint_forkserver: `multiprocessing.SharedMemory` ' 'subint_forkserver: `multiprocessing.SharedMemory` '

View File

@ -17,7 +17,7 @@
Utils to tame mp non-SC madeness Utils to tame mp non-SC madeness
''' '''
import platform from functools import partial
def disable_mantracker(): def disable_mantracker():
@ -27,30 +27,13 @@ def disable_mantracker():
''' '''
from multiprocessing.shared_memory import SharedMemory from multiprocessing.shared_memory import SharedMemory
# 3.13+ only.. can pass `track=False` to disable
# all the resource tracker bs.
# https://docs.python.org/3/library/multiprocessing.shared_memory.html
if (_py_313 := (
platform.python_version_tuple()[:-1]
>=
('3', '13')
)
):
from functools import partial
return partial(
SharedMemory,
track=False,
)
# !TODO, once we drop 3.12- we can obvi remove all this!
else:
from multiprocessing import ( from multiprocessing import (
resource_tracker as mantracker, resource_tracker as mantracker,
) )
# Tell the "resource tracker" thing to fuck off. # XXX ALWAYS disable the stdlib's "resource tracker"; it prevents
# fork backends and never was useful to us since we're SC
# lifetime managing all allocations.
class ManTracker(mantracker.ResourceTracker): class ManTracker(mantracker.ResourceTracker):
def register(self, name, rtype): def register(self, name, rtype):
pass pass
@ -69,7 +52,12 @@ def disable_mantracker():
mantracker.unregister = mantracker._resource_tracker.unregister mantracker.unregister = mantracker._resource_tracker.unregister
mantracker.getfd = mantracker._resource_tracker.getfd mantracker.getfd = mantracker._resource_tracker.getfd
# use std type verbatim # 3.13+ only.. can pass `track=False` to disable
shmT = SharedMemory # all the resource tracker bs.
# https://docs.python.org/3/library/multiprocessing.shared_memory.html
shmT = partial(
SharedMemory,
track=False,
)
return shmT return shmT

View File

@ -929,15 +929,26 @@ def open_shm_list(
# "close" attached shm on actor teardown # "close" attached shm on actor teardown
try: try:
actor = tractor.current_actor() actor = tractor.current_actor()
actor.lifetime_stack.callback(shml.shm.close) actor.lifetime_stack.callback(shml.shm.close)
# XXX on 3.13+ we don't need to call this? # >XXX NOTE< on 3.13+ we need to call this AS WELL AS pass
# -> bc we pass `track=False` for `SharedMemeory` orr? # `track=False` for `mp.SharedMemeory` otherwise fork based
if ( # backends will error out due to long lived stdlib
platform.python_version_tuple()[:-1] < ('3', '13') # limitations,
): # - https://bugs.python.org/issue38119
actor.lifetime_stack.callback(shml.shm.unlink) # - https://bugs.python.org/issue45209
#
def try_unlink():
try:
shml.shm.unlink()
except FileNotFoundError as fne:
log.debug(
f'ShmList already deallocated pre-actor-shutdown.\n'
f'{fne!r}\n'
)
actor.lifetime_stack.callback(try_unlink)
except RuntimeError: except RuntimeError:
log.warning('tractor runtime not active, skipping teardown steps') log.warning('tractor runtime not active, skipping teardown steps')