Compare commits
8 Commits
d60245777e
...
9bbb6f796b
| Author | SHA1 | Date |
|---|---|---|
|
|
9bbb6f796b | |
|
|
a24600f1a7 | |
|
|
92443dc4ef | |
|
|
d3cbc92751 | |
|
|
099104e0af | |
|
|
abd3950ba6 | |
|
|
7d1e4462d4 | |
|
|
522b57570b |
|
|
@ -83,10 +83,27 @@ jobs:
|
|||
|
||||
|
||||
testing:
|
||||
name: '${{ matrix.os }} Python${{ matrix.python-version }} spawn_backend=${{ matrix.spawn_backend }} tpt_proto=${{ matrix.tpt_proto }}'
|
||||
timeout-minutes: 16
|
||||
name: '${{ matrix.os }} Python${{ matrix.python-version }} spawn_backend=${{ matrix.spawn_backend }} tpt_proto=${{ matrix.tpt_proto }} capture=${{ matrix.capture }}'
|
||||
timeout-minutes: 20
|
||||
runs-on: ${{ matrix.os }}
|
||||
|
||||
# NOTE on the matrix shape — the `capture=` mode follows
|
||||
# `spawn_backend`:
|
||||
#
|
||||
# - `trio` / `mp_*` backends use `--capture=fd` (default)
|
||||
# for per-test attribution of subactor *raw-fd* output
|
||||
# in failure reports.
|
||||
# - Fork-based backends (`main_thread_forkserver`,
|
||||
# `subint_forkserver`) REQUIRE `--capture=sys` because
|
||||
# fork-child × `--capture=fd` is a known deadlock
|
||||
# pattern. See the long NOTE in `tractor._testing.pytest`'s
|
||||
# `pytest_load_initial_conftests` for the mechanism +
|
||||
# tradeoff write-up.
|
||||
#
|
||||
# If a future matrix row adds a fork-spawn backend
|
||||
# WITHOUT setting `capture: 'sys'`, the
|
||||
# `pytest_load_initial_conftests` hook fail-fasts on `CI=1`
|
||||
# with a clear error msg. So the matrix is self-policing.
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
|
|
@ -113,6 +130,26 @@ jobs:
|
|||
'tcp',
|
||||
'uds',
|
||||
]
|
||||
capture: [
|
||||
'fd', # default for non-fork backends
|
||||
]
|
||||
|
||||
# Fork-based backends — added via `include:` so each
|
||||
# cell carries its REQUIRED `capture: 'sys'` mode.
|
||||
# Linux-only for now; macOS coverage TBD pending
|
||||
# local validation.
|
||||
include:
|
||||
- os: ubuntu-latest
|
||||
python-version: '3.13'
|
||||
spawn_backend: 'main_thread_forkserver'
|
||||
tpt_proto: 'tcp'
|
||||
capture: 'sys'
|
||||
- os: ubuntu-latest
|
||||
python-version: '3.13'
|
||||
spawn_backend: 'main_thread_forkserver'
|
||||
tpt_proto: 'uds'
|
||||
capture: 'sys'
|
||||
|
||||
# https://github.com/orgs/community/discussions/26253#discussioncomment-3250989
|
||||
exclude:
|
||||
# don't do UDS run on macOS (for now)
|
||||
|
|
@ -153,8 +190,11 @@ jobs:
|
|||
-rsx
|
||||
--spawn-backend=${{ matrix.spawn_backend }}
|
||||
--tpt-proto=${{ matrix.tpt_proto }}
|
||||
--capture=fd
|
||||
# ^XXX^ can't work with --spawn-method=main_thread_forkserver
|
||||
--capture=${{ matrix.capture }}
|
||||
# NOTE: capture mode is matrix-driven — `fd` for
|
||||
# non-fork backends (per-test fd attribution),
|
||||
# `sys` for fork-based (avoids fork-child x
|
||||
# capture-fd deadlock). See matrix-NOTE above.
|
||||
|
||||
# XXX legacy NOTE XXX
|
||||
#
|
||||
|
|
|
|||
|
|
@ -0,0 +1,142 @@
|
|||
# Spawn-time boot-death (`rc=2`) under rapid same-name spawn against a registrar
|
||||
|
||||
## Symptom
|
||||
|
||||
Spawning N (≥4) sub-actors with the **same name** in tight
|
||||
succession against a daemon registrar surfaces as
|
||||
`ActorFailure: Sub-actor (...) died during boot (rc=2)
|
||||
before completing parent-handshake`.
|
||||
|
||||
```
|
||||
tests/discovery/test_multi_program.py
|
||||
::test_dup_name_cancel_cascade_escalates_to_hard_kill[n_dups=4]
|
||||
```
|
||||
|
||||
```
|
||||
tractor._exceptions.ActorFailure:
|
||||
Sub-actor ('doggy', '<uuid>') died during boot (rc=2)
|
||||
before completing parent-handshake.
|
||||
proc: <_ForkedProc pid=<n> returncode=None>
|
||||
```
|
||||
|
||||
The `proc` repr shows `returncode=None` because the repr is
|
||||
captured before `proc.wait()` returns; the actual
|
||||
`os.WEXITSTATUS == 2` is reported via `result['died']` in the
|
||||
race-helper.
|
||||
|
||||
## When it surfaces
|
||||
|
||||
- N=2 (`n_dups=2`): **always passes**.
|
||||
- N=4 (`n_dups=4`): **consistent fail** under both `tpt-proto=tcp`
|
||||
and `tpt-proto=uds`, MTF backend.
|
||||
- N=8 (`n_dups=8`): **passes** (counter-intuitive — see "racing
|
||||
windows").
|
||||
- Non-MTF backends: not yet exercised systematically.
|
||||
|
||||
## What previously masked it
|
||||
|
||||
Pre the spawn-time `wait_for_peer_or_proc_death` race-helper
|
||||
(in `tractor.spawn._spawn`), the parent's `start_actor` flow
|
||||
ended with a bare:
|
||||
|
||||
```python
|
||||
event, chan = await ipc_server.wait_for_peer(uid)
|
||||
```
|
||||
|
||||
That awaits an unsignalled `trio.Event` on `_peer_connected[uid]`.
|
||||
If the sub-actor process **dies during boot** (before its
|
||||
runtime executes the parent-callback handshake that sets the
|
||||
event), the wait parks forever. The dead proc becomes a zombie
|
||||
because no one ever calls `proc.wait()` to reap it.
|
||||
|
||||
In test contexts the failure presented as a hang or a much
|
||||
later `trio.TooSlowError` from an outer `fail_after`. In
|
||||
production it'd present as a parent that never makes progress
|
||||
past `start_actor`. The death itself was silently masked.
|
||||
|
||||
## What surfaces it now
|
||||
|
||||
`tractor.spawn._spawn.wait_for_peer_or_proc_death` (used by
|
||||
`_main_thread_forkserver_proc`) races the handshake-wait
|
||||
against `proc.wait()`. The race-helper raises `ActorFailure`
|
||||
on death-first instead of parking, exposing the rc=2.
|
||||
|
||||
## Hypothesis: registrar-side same-name contention
|
||||
|
||||
The test spawns N actors with name `doggy` sequentially:
|
||||
|
||||
```python
|
||||
for i in range(n_dups):
|
||||
p: Portal = await an.start_actor('doggy')
|
||||
portals.append(p)
|
||||
```
|
||||
|
||||
Each spawned doggy:
|
||||
|
||||
1. Forks via the forkserver.
|
||||
2. Boots its runtime in `_actor_child_main`.
|
||||
3. Connects back to the parent for handshake.
|
||||
4. Connects to the daemon registrar to call `register_actor`.
|
||||
5. Enters its RPC msg-loop.
|
||||
|
||||
Step (4) is where the same-name contention lives. The
|
||||
registrar's `register_actor` (in
|
||||
`tractor.discovery._registry`) accepts duplicate names
|
||||
(stores `(name, uuid) -> addr`), but its internal bookkeeping
|
||||
may have a non-trivial check (e.g. `wait_for_actor` resolution,
|
||||
`_addrs2aids` map updates) that errors out under specific
|
||||
ordering between the existing entry and the incoming one.
|
||||
|
||||
`rc=2 == os.WEXITSTATUS == 2` corresponds to `sys.exit(2)`
|
||||
in the doggy process — typically reached via an unhandled
|
||||
exception that's translated to exit code 2 by Python's top-
|
||||
level (e.g. `argparse` errors use 2; `SystemExit(2)` etc.).
|
||||
So the doggy is hitting an explicit exit path during
|
||||
`register_actor` or just-after.
|
||||
|
||||
The non-monotonic shape (N=2 OK, N=4 BAD, N=8 OK) suggests a
|
||||
specific timing window — likely "the 3rd register-RPC arrives
|
||||
while the 1st-or-2nd is in some intermediate state". With
|
||||
N=8, the additional procs widen the registration spread
|
||||
enough that no two land in the conflicting window.
|
||||
|
||||
## Where to dig next
|
||||
|
||||
- Add per-actor logging in `_actor_child_main` and
|
||||
`register_actor` to surface the actual exception that
|
||||
triggers the rc=2 exit. Currently the doggy dies before
|
||||
the parent ever sees its stderr (forkserver doesn't
|
||||
marshal child stdio back).
|
||||
- Race-test the registrar's `register_actor` /
|
||||
`unregister_actor` / `wait_for_actor` against same-name
|
||||
concurrent calls in isolation (no spawn).
|
||||
- Consider whether `register_actor` should be idempotent
|
||||
under same-name re-register or should explicitly reject
|
||||
same-name (and ideally with a clear `RemoteActorError`,
|
||||
not `sys.exit(2)`).
|
||||
|
||||
## Test-suite handling
|
||||
|
||||
Currently:
|
||||
|
||||
- `tests/discovery/test_multi_program.py
|
||||
::test_dup_name_cancel_cascade_escalates_to_hard_kill[n_dups=4]`
|
||||
is `pytest.mark.xfail(strict=False, reason=...)` to keep
|
||||
the suite green while this issue is investigated.
|
||||
- `n_dups=2` and `n_dups=8` continue to validate the
|
||||
cancel-cascade hard-kill escalation.
|
||||
|
||||
Once the underlying race is understood + fixed, drop the
|
||||
xfail.
|
||||
|
||||
## Related work
|
||||
|
||||
- The cancel-cascade fix that introduced this regression
|
||||
test:
|
||||
`tractor/_exceptions.py:ActorTooSlowError`,
|
||||
`tractor/runtime/_supervise.py:_try_cancel_then_kill`,
|
||||
`tractor/runtime/_portal.py:Portal.cancel_actor(
|
||||
raise_on_timeout=...)`.
|
||||
- The spawn-time death-detection that exposed this:
|
||||
`tractor/spawn/_spawn.py:wait_for_peer_or_proc_death`,
|
||||
used by `tractor/spawn/_main_thread_forkserver.py`.
|
||||
|
|
@ -144,32 +144,37 @@ def test_register_duplicate_name(
|
|||
trio.run(main)
|
||||
|
||||
|
||||
# `n_dups` in {4, 8} both expose the SAME pre-existing race:
|
||||
# under rapid same-name spawning against a forkserver +
|
||||
# registrar, ONE of the spawned doggies `sys.exit(2)`s during
|
||||
# boot before completing parent-handshake. Surfaces now (post
|
||||
# the spawn-time `wait_for_peer_or_proc_death` fix) as
|
||||
# `ActorFailure rc=2`; previously it was silently masked by
|
||||
# the handshake-wait parking forever.
|
||||
#
|
||||
# Larger `n_dups` widens the race window so the boot-race
|
||||
# fires more often — n_dups=4 hits ~always, n_dups=8 hits
|
||||
# occasionally. Both xfail(strict=False) so the cancel-cascade
|
||||
# regression-check still passes when the boot-race happens
|
||||
# NOT to fire.
|
||||
#
|
||||
# Tracked separately in,
|
||||
# https://github.com/goodboy/tractor/issues/456
|
||||
_DOGGY_BOOT_RACE_XFAIL = pytest.mark.xfail(
|
||||
strict=False,
|
||||
reason=(
|
||||
'doggy boot-race rc=2 under rapid same-name '
|
||||
'spawn — separate bug from cancel-cascade'
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
'n_dups',
|
||||
[
|
||||
2,
|
||||
# `n_dups=4` exposes a SEPARATE pre-existing race: under
|
||||
# rapid same-name spawning against a forkserver +
|
||||
# registrar, ONE of the spawned doggies (typically the
|
||||
# 3rd) `sys.exit(2)`s during boot before completing
|
||||
# parent-handshake. Surfaces now (post the spawn-time
|
||||
# `wait_for_peer_or_proc_death` fix) as `ActorFailure
|
||||
# rc=2`; previously it was silently masked by the
|
||||
# handshake-wait parking forever.
|
||||
#
|
||||
# Tracked separately in,
|
||||
# https://github.com/goodboy/tractor/issues/456
|
||||
pytest.param(
|
||||
4,
|
||||
marks=pytest.mark.xfail(
|
||||
strict=False,
|
||||
reason=(
|
||||
'doggy boot-race rc=2 under rapid same-name '
|
||||
'spawn — separate bug from cancel-cascade'
|
||||
),
|
||||
),
|
||||
),
|
||||
8,
|
||||
pytest.param(4, marks=_DOGGY_BOOT_RACE_XFAIL),
|
||||
pytest.param(8, marks=_DOGGY_BOOT_RACE_XFAIL),
|
||||
],
|
||||
ids=lambda n: f'n_dups={n}',
|
||||
)
|
||||
|
|
|
|||
|
|
@ -22,6 +22,20 @@ from tractor.discovery._multiaddr import mk_maddr
|
|||
import trio
|
||||
|
||||
|
||||
pytestmark = pytest.mark.usefixtures(
|
||||
'reap_subactors_per_test',
|
||||
# NOTE, registrar tests stress the discovery
|
||||
# roundtrip (find_actor / wait_for_actor) which
|
||||
# historically left orphaned UDS sock-files when
|
||||
# subactor `hard_kill` SIGKILL'd, and which
|
||||
# exercises the same trio `WakeupSocketpair`
|
||||
# peer-disconnect path that triggered the
|
||||
# busy-loop bug class.
|
||||
'track_orphaned_uds_per_test',
|
||||
'detect_runaway_subactors_per_test',
|
||||
)
|
||||
|
||||
|
||||
@tractor_test
|
||||
async def test_reg_then_unreg(
|
||||
reg_addr: tuple,
|
||||
|
|
@ -106,19 +120,6 @@ async def hi():
|
|||
return the_line.format(tractor.current_actor().name)
|
||||
|
||||
|
||||
async def say_hello(
|
||||
other_actor: str,
|
||||
reg_addr: tuple[str, int],
|
||||
):
|
||||
await trio.sleep(1) # wait for other actor to spawn
|
||||
async with tractor.find_actor(
|
||||
other_actor,
|
||||
registry_addrs=[reg_addr],
|
||||
) as portal:
|
||||
assert portal is not None
|
||||
return await portal.run(__name__, 'hi')
|
||||
|
||||
|
||||
async def say_hello_use_wait(
|
||||
other_actor: str,
|
||||
reg_addr: tuple[str, int],
|
||||
|
|
@ -132,18 +133,17 @@ async def say_hello_use_wait(
|
|||
return result
|
||||
|
||||
|
||||
@pytest.mark.timeout(
|
||||
7,
|
||||
method='thread',
|
||||
@tractor_test(
|
||||
timeout=7,
|
||||
)
|
||||
@tractor_test
|
||||
@pytest.mark.parametrize(
|
||||
'func',
|
||||
[say_hello,
|
||||
say_hello_use_wait]
|
||||
'ria_fn',
|
||||
[
|
||||
say_hello_use_wait,
|
||||
]
|
||||
)
|
||||
async def test_trynamic_trio(
|
||||
func: Callable,
|
||||
ria_fn: Callable,
|
||||
start_method: str,
|
||||
reg_addr: tuple,
|
||||
):
|
||||
|
|
@ -156,13 +156,13 @@ async def test_trynamic_trio(
|
|||
print("Alright... Action!")
|
||||
|
||||
donny = await n.run_in_actor(
|
||||
func,
|
||||
ria_fn,
|
||||
other_actor='gretchen',
|
||||
reg_addr=reg_addr,
|
||||
name='donny',
|
||||
)
|
||||
gretchen = await n.run_in_actor(
|
||||
func,
|
||||
ria_fn,
|
||||
other_actor='donny',
|
||||
reg_addr=reg_addr,
|
||||
name='gretchen',
|
||||
|
|
@ -324,6 +324,14 @@ async def spawn_and_check_registry(
|
|||
assert actor.aid.uid in registry
|
||||
|
||||
|
||||
async def with_timeout(
|
||||
main: Callable,
|
||||
timeout: float = 6,
|
||||
):
|
||||
with trio.fail_after(timeout):
|
||||
await main()
|
||||
|
||||
|
||||
@pytest.mark.parametrize('use_signal', [False, True])
|
||||
@pytest.mark.parametrize('with_streaming', [False, True])
|
||||
def test_subactors_unregister_on_cancel(
|
||||
|
|
@ -340,6 +348,7 @@ def test_subactors_unregister_on_cancel(
|
|||
'''
|
||||
with pytest.raises(KeyboardInterrupt):
|
||||
trio.run(
|
||||
# with_timeout,
|
||||
partial(
|
||||
spawn_and_check_registry,
|
||||
reg_addr,
|
||||
|
|
@ -369,6 +378,7 @@ def test_subactors_unregister_on_cancel_remote_daemon(
|
|||
'''
|
||||
with pytest.raises(KeyboardInterrupt):
|
||||
trio.run(
|
||||
with_timeout,
|
||||
partial(
|
||||
spawn_and_check_registry,
|
||||
reg_addr,
|
||||
|
|
@ -547,7 +557,7 @@ async def kill_transport(
|
|||
# 'main_thread_forkserver',
|
||||
reason=(
|
||||
'XXX SUBINT HANGING TEST XXX\n'
|
||||
'See oustanding issue(s)\n'
|
||||
'See outstanding issue(s)\n'
|
||||
# TODO, put issue link!
|
||||
)
|
||||
)
|
||||
|
|
@ -556,6 +566,7 @@ def test_stale_entry_is_deleted(
|
|||
daemon: subprocess.Popen,
|
||||
start_method: str,
|
||||
reg_addr: tuple,
|
||||
# set_fork_aware_capture,
|
||||
):
|
||||
'''
|
||||
Ensure that when a stale entry is detected in the registrar's
|
||||
|
|
@ -598,11 +609,17 @@ def test_stale_entry_is_deleted(
|
|||
|
||||
# XXX, for tracing if this starts being flaky again..
|
||||
#
|
||||
# async def _timeout_main():
|
||||
# with trio.move_on_after(4) as cs:
|
||||
# await main()
|
||||
# if cs.cancel_called:
|
||||
# await tractor.pause()
|
||||
timeout: float = 4
|
||||
async def _timeout_main():
|
||||
with trio.move_on_after(timeout) as cs:
|
||||
await main()
|
||||
|
||||
if (
|
||||
cs.cancel_called
|
||||
and
|
||||
debug_mode
|
||||
):
|
||||
await tractor.pause()
|
||||
|
||||
# TODO, remove once the `[subint]` variant no longer hangs.
|
||||
#
|
||||
|
|
@ -650,8 +667,7 @@ def test_stale_entry_is_deleted(
|
|||
# `pytest`'s stderr capture eats `faulthandler` output otherwise,
|
||||
# so we route `dump_on_hang` to a file.
|
||||
with dump_on_hang(
|
||||
seconds=20,
|
||||
seconds=timeout*2,
|
||||
path=f'/tmp/test_stale_entry_is_deleted_{start_method}.dump',
|
||||
):
|
||||
trio.run(main)
|
||||
# trio.run(_timeout_main)
|
||||
trio.run(_timeout_main)
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
"""
|
||||
'''
|
||||
Streaming via the, now legacy, "async-gen API".
|
||||
|
||||
"""
|
||||
'''
|
||||
import time
|
||||
from functools import partial
|
||||
import platform
|
||||
|
|
@ -12,6 +12,11 @@ import tractor
|
|||
import pytest
|
||||
|
||||
from tractor._testing import tractor_test
|
||||
from tractor._exceptions import ActorTooSlowError
|
||||
|
||||
_non_linux: bool = (
|
||||
_sys := platform.system()
|
||||
) != 'Linux'
|
||||
|
||||
|
||||
def test_must_define_ctx():
|
||||
|
|
@ -68,8 +73,10 @@ async def stream_from_single_subactor(
|
|||
start_method,
|
||||
stream_func,
|
||||
):
|
||||
"""Verify we can spawn a daemon actor and retrieve streamed data.
|
||||
"""
|
||||
'''
|
||||
Verify we can spawn a daemon actor and retrieve streamed data.
|
||||
|
||||
'''
|
||||
# only one per host address, spawns an actor if None
|
||||
|
||||
async with tractor.open_nursery(
|
||||
|
|
@ -242,14 +249,19 @@ async def a_quadruple_example() -> list[int]:
|
|||
start = time.time()
|
||||
# the portal call returns exactly what you'd expect
|
||||
# as if the remote "aggregate" function was called locally
|
||||
result_stream = []
|
||||
result_stream: list[int] = []
|
||||
|
||||
async with portal.open_stream_from(aggregate, seed=seed) as stream:
|
||||
async with portal.open_stream_from(
|
||||
aggregate,
|
||||
seed=seed,
|
||||
) as stream:
|
||||
async for value in stream:
|
||||
result_stream.append(value)
|
||||
|
||||
print(f"STREAM TIME = {time.time() - start}")
|
||||
print(f"STREAM + SPAWN TIME = {time.time() - pre_start}")
|
||||
print(
|
||||
f"STREAM TIME = {time.time() - start}\n"
|
||||
f"STREAM + SPAWN TIME = {time.time() - pre_start}\n"
|
||||
)
|
||||
assert result_stream == list(range(seed))
|
||||
await portal.cancel_actor()
|
||||
return result_stream
|
||||
|
|
@ -258,13 +270,24 @@ async def a_quadruple_example() -> list[int]:
|
|||
async def cancel_after(
|
||||
wait: float,
|
||||
reg_addr: tuple,
|
||||
expect_cancel: bool,
|
||||
) -> list[int]:
|
||||
|
||||
async with tractor.open_root_actor(
|
||||
registry_addrs=[reg_addr],
|
||||
):
|
||||
with trio.move_on_after(wait):
|
||||
return await a_quadruple_example()
|
||||
res: list[int]|None = None
|
||||
with trio.move_on_after(wait) as cs:
|
||||
res: list[int] = await a_quadruple_example()
|
||||
return res
|
||||
|
||||
if (
|
||||
not expect_cancel
|
||||
and
|
||||
cs.cancelled_caught
|
||||
):
|
||||
assert not res
|
||||
raise ActorTooSlowError
|
||||
|
||||
|
||||
@pytest.fixture(scope='module')
|
||||
|
|
@ -272,9 +295,14 @@ def time_quad_ex(
|
|||
reg_addr: tuple,
|
||||
ci_env: bool,
|
||||
spawn_backend: str,
|
||||
is_forking_spawner: bool,
|
||||
tpt_proto: str,
|
||||
):
|
||||
non_linux: bool = (_sys := platform.system()) != 'Linux'
|
||||
if ci_env and non_linux:
|
||||
if (
|
||||
ci_env
|
||||
and
|
||||
_non_linux
|
||||
):
|
||||
pytest.skip(f'Test is too flaky on {_sys!r} in CI')
|
||||
|
||||
if spawn_backend == 'mp':
|
||||
|
|
@ -284,15 +312,36 @@ def time_quad_ex(
|
|||
'''
|
||||
pytest.skip("Test is too flaky on mp in CI")
|
||||
|
||||
timeout = 7 if non_linux else 4
|
||||
start = time.time()
|
||||
results: list[int] = trio.run(
|
||||
cancel_after,
|
||||
timeout,
|
||||
reg_addr,
|
||||
timeout: float = (
|
||||
7 if _non_linux
|
||||
else 4
|
||||
)
|
||||
|
||||
if (
|
||||
is_forking_spawner
|
||||
and
|
||||
tpt_proto in [
|
||||
'uds',
|
||||
]
|
||||
):
|
||||
timeout += 1
|
||||
|
||||
start: float = time.time()
|
||||
results: list[int] = trio.run(partial(
|
||||
cancel_after,
|
||||
wait=timeout,
|
||||
reg_addr=reg_addr,
|
||||
expect_cancel=True,
|
||||
))
|
||||
diff: float = time.time() - start
|
||||
assert results
|
||||
if results is None:
|
||||
raise ActorTooSlowError(
|
||||
f'Streaming example took longer then timeout ??\n'
|
||||
f'timeout={timeout!r}\n'
|
||||
f'diff={diff!r}\n'
|
||||
f'results={results!r}\n'
|
||||
)
|
||||
|
||||
return results, diff
|
||||
|
||||
|
||||
|
|
@ -307,11 +356,10 @@ def test_a_quadruple_example(
|
|||
given past empirical eval of this suite.
|
||||
|
||||
'''
|
||||
non_linux: bool = (_sys := platform.system()) != 'Linux'
|
||||
|
||||
this_fast_on_linux: float = 3
|
||||
this_fast = (
|
||||
6 if non_linux
|
||||
6 if _non_linux
|
||||
else this_fast_on_linux
|
||||
)
|
||||
# ^ XXX NOTE,
|
||||
|
|
@ -348,21 +396,26 @@ def test_not_fast_enough_quad(
|
|||
reg_addr: tuple,
|
||||
time_quad_ex: tuple[list[int], float],
|
||||
cancel_delay: float,
|
||||
|
||||
ci_env: bool,
|
||||
spawn_backend: str,
|
||||
is_forking_spawner: bool,
|
||||
tpt_proto: str,
|
||||
test_log: tractor.log.StackLevelAdapter,
|
||||
):
|
||||
'''
|
||||
Verify we can cancel midway through the quad example and all
|
||||
actors cancel gracefully.
|
||||
Verify we can cancel midway through `a_quadruple_example()`, at
|
||||
various delays, and all subactors cancel gracefully.
|
||||
|
||||
'''
|
||||
results, diff = time_quad_ex
|
||||
delay = max(diff - cancel_delay, 0)
|
||||
results = trio.run(
|
||||
results: list[int] = trio.run(partial(
|
||||
cancel_after,
|
||||
delay,
|
||||
reg_addr,
|
||||
)
|
||||
wait=delay,
|
||||
reg_addr=reg_addr,
|
||||
expect_cancel=True,
|
||||
))
|
||||
system: str = platform.system()
|
||||
if (
|
||||
system in ('Windows', 'Darwin')
|
||||
|
|
@ -373,6 +426,20 @@ def test_not_fast_enough_quad(
|
|||
# so just ignore these
|
||||
print(f'Woa there {system} caught your breath eh?')
|
||||
else:
|
||||
if (
|
||||
results
|
||||
and
|
||||
is_forking_spawner
|
||||
and
|
||||
tpt_proto in [
|
||||
'uds',
|
||||
]
|
||||
):
|
||||
pytest.xfail(
|
||||
f'Spawning backend + tpt-proto is too fast XD\n'
|
||||
f'{spawn_backend!r} + {tpt_proto!r}\n'
|
||||
)
|
||||
|
||||
# should be cancelled mid-streaming
|
||||
assert results is None
|
||||
|
||||
|
|
|
|||
|
|
@ -188,6 +188,86 @@ def _read_cmdline(pid: int) -> str:
|
|||
return ''
|
||||
|
||||
|
||||
def _read_comm(pid: int) -> str:
|
||||
'''
|
||||
Read `/proc/<pid>/comm` — the kernel's per-task name
|
||||
(truncated to ~15 bytes on Linux). Set by
|
||||
`setproctitle.setproctitle()` so this is one of the
|
||||
most reliable identifiers for tractor sub-actors —
|
||||
notably, **survives zombie state** (kernel preserves
|
||||
`comm` even after exit, until reaped) where
|
||||
`cmdline`/`environ` may not.
|
||||
|
||||
'''
|
||||
try:
|
||||
with open(f'/proc/{pid}/comm') as f:
|
||||
return f.read().rstrip('\n')
|
||||
except (
|
||||
FileNotFoundError,
|
||||
PermissionError,
|
||||
ProcessLookupError,
|
||||
):
|
||||
return ''
|
||||
|
||||
|
||||
# Intrinsic markers that identify a tractor sub-actor
|
||||
# regardless of cwd / venv path / launch context. Used by
|
||||
# `_is_tractor_subactor()` below.
|
||||
#
|
||||
# - cmdline `tractor[`: matches the `setproctitle`-set form
|
||||
# (`tractor[<aid.reprol()>]`) — set in
|
||||
# `_actor_child_main` for ALL backends, mutates argv via
|
||||
# libc so visible in `/proc/<pid>/cmdline`.
|
||||
# - cmdline `tractor._child`: matches the legacy
|
||||
# `python -m tractor._child --uid (...)` form. Catches
|
||||
# procs that died before `_actor_child_main` got to call
|
||||
# `setproctitle()` — argv from exec is still kernel-
|
||||
# visible at that point.
|
||||
# - comm `tractor[`: same proctitle-set form, but visible
|
||||
# via `/proc/<pid>/comm` (kernel-truncated to ~15 bytes,
|
||||
# `tractor[doggy:`). Critical for ZOMBIES — kernel
|
||||
# preserves `comm` past task-exit until parent reaps,
|
||||
# while `cmdline` for zombies often reads as empty.
|
||||
_TRACTOR_PROC_CMDLINE_MARKERS: tuple[str, ...] = (
|
||||
'tractor._child',
|
||||
'tractor[',
|
||||
)
|
||||
_TRACTOR_PROC_COMM_MARKER: str = 'tractor['
|
||||
|
||||
|
||||
def _is_tractor_subactor(pid: int) -> bool:
|
||||
'''
|
||||
Detect whether `pid` is a tractor sub-actor process
|
||||
using **intrinsic** signals — cmdline → comm — in
|
||||
priority order.
|
||||
|
||||
No filesystem-state coupling (cwd / venv path) and no
|
||||
env-var dependency: `setproctitle`-mutated argv (set
|
||||
in `_actor_child_main`) covers all live + most-zombie
|
||||
cases; legacy `python -m tractor._child` cmdline
|
||||
catches anything that died before `setproctitle` ran;
|
||||
kernel `comm` covers zombies that survived past
|
||||
`_actor_child_main` long enough to setproctitle.
|
||||
|
||||
'''
|
||||
# 1. cmdline match — catches both `setproctitle`-set
|
||||
# `tractor[<aid>]` (live) AND legacy `python -m
|
||||
# tractor._child` (any) form.
|
||||
cmdline: str = _read_cmdline(pid)
|
||||
if any(m in cmdline for m in _TRACTOR_PROC_CMDLINE_MARKERS):
|
||||
return True
|
||||
|
||||
# 2. Zombie-resilient fallback: kernel-preserved `comm`
|
||||
# (set by setproctitle). Critical for zombies whose
|
||||
# `cmdline` reads as empty post-exit but whose
|
||||
# `comm` survives to `wait()` time.
|
||||
comm: str = _read_comm(pid)
|
||||
if _TRACTOR_PROC_COMM_MARKER in comm:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def _iter_live_pids() -> list[int]:
|
||||
'''
|
||||
Enumerate currently-alive pids from `/proc`. Returns
|
||||
|
|
@ -291,34 +371,89 @@ def find_runaway_subactors(
|
|||
return runaways
|
||||
|
||||
|
||||
def _read_status_state(pid: int) -> str | None:
|
||||
'''
|
||||
Return the single-letter task state from
|
||||
`/proc/<pid>/status` (`R`/`S`/`D`/`Z`/`T`/`X`/`I`) or
|
||||
`None` if unreadable. `Z` = zombie.
|
||||
|
||||
'''
|
||||
try:
|
||||
with open(f'/proc/{pid}/status') as f:
|
||||
for line in f:
|
||||
if line.startswith('State:'):
|
||||
# `State:\tZ (zombie)` -> 'Z'
|
||||
parts = line.split()
|
||||
if len(parts) >= 2:
|
||||
return parts[1]
|
||||
except (
|
||||
FileNotFoundError,
|
||||
PermissionError,
|
||||
ProcessLookupError,
|
||||
):
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def find_orphans(
|
||||
repo_root: pathlib.Path,
|
||||
repo_root: pathlib.Path|None = None,
|
||||
) -> list[int]:
|
||||
'''
|
||||
PIDs that are:
|
||||
PIDs that are reparented to init (`PPid == 1`) AND
|
||||
are tractor sub-actors per `_is_tractor_subactor()`'s
|
||||
intrinsic checks (env-var → cmdline → comm).
|
||||
|
||||
- reparented to init (`PPid == 1`),
|
||||
- have `cwd == <repo_root>`,
|
||||
- and have a `python` in their cmdline.
|
||||
|
||||
This is the "pytest-died-mid-session" case where the
|
||||
subactor forks got reparented. The cwd filter is the
|
||||
critical bit that keeps us from sweeping up unrelated
|
||||
init-children on the box.
|
||||
The `repo_root` arg is kept for back-compat with
|
||||
callers that previously passed it (the old impl used
|
||||
it to filter by cwd) but is no longer required —
|
||||
tractor sub-actor identity is intrinsic to the proc,
|
||||
not its launch context.
|
||||
|
||||
'''
|
||||
repo: str = str(repo_root)
|
||||
# `repo_root` kept in signature for back-compat; today
|
||||
# the intrinsic env/cmdline/comm signals identify a
|
||||
# tractor sub-actor without coincidence-of-cwd
|
||||
# matching. Suppressed-arg stays a no-op so existing
|
||||
# callers don't have to change.
|
||||
_ = repo_root # noqa
|
||||
hits: list[int] = []
|
||||
for pid in _iter_live_pids():
|
||||
if _read_status_ppid(pid) != 1:
|
||||
continue
|
||||
cwd: str | None = _read_cwd(pid)
|
||||
if cwd != repo:
|
||||
if _is_tractor_subactor(pid):
|
||||
hits.append(pid)
|
||||
return hits
|
||||
|
||||
|
||||
def find_zombies(
|
||||
parent_pid: int|None = None,
|
||||
) -> list[int]:
|
||||
'''
|
||||
PIDs in zombie state (`/proc/<pid>/status: State: Z`)
|
||||
that are tractor sub-actors per
|
||||
`_is_tractor_subactor()`.
|
||||
|
||||
When `parent_pid` is given, restricts to descendants
|
||||
of that pid (typical for pytest session-end fixture
|
||||
use). When `None`, scans all zombies on the box.
|
||||
|
||||
Detection for zombies relies primarily on
|
||||
`/proc/<pid>/comm` (kernel-preserved past zombie
|
||||
state, set by `setproctitle`) since
|
||||
`cmdline`/`environ` are usually empty post-exit.
|
||||
|
||||
'''
|
||||
hits: list[int] = []
|
||||
for pid in _iter_live_pids():
|
||||
if _read_status_state(pid) != 'Z':
|
||||
continue
|
||||
cmd: str = _read_cmdline(pid)
|
||||
if 'python' not in cmd:
|
||||
if (
|
||||
parent_pid is not None
|
||||
and _read_status_ppid(pid) != parent_pid
|
||||
):
|
||||
continue
|
||||
hits.append(pid)
|
||||
if _is_tractor_subactor(pid):
|
||||
hits.append(pid)
|
||||
return hits
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -640,7 +640,7 @@ def pytest_generate_tests(
|
|||
metafunc.parametrize(
|
||||
"start_method",
|
||||
[spawn_backend],
|
||||
scope='module',
|
||||
scope='session',
|
||||
ids=lambda item: f'start_method={spawn_backend}',
|
||||
)
|
||||
|
||||
|
|
@ -662,7 +662,7 @@ def _is_forking_spawner(
|
|||
]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def is_forking_spawner(
|
||||
start_method: str,
|
||||
) -> bool:
|
||||
|
|
|
|||
|
|
@ -83,11 +83,15 @@ SpawnMethodKey = Literal[
|
|||
# runtime, exactly like `trio_proc` but via fork instead
|
||||
# of subproc-exec. See `tractor.spawn._main_thread_forkserver`.
|
||||
'main_thread_forkserver',
|
||||
# RESERVED for the future variant-2 subint-isolated-child
|
||||
# runtime — gated on jcrist/msgspec#1026 + PEP 684. Today
|
||||
# this key aliases to `main_thread_forkserver_proc`; once
|
||||
# the upstream unblocks land it'll dispatch to the
|
||||
# subint-hosted-trio impl. See
|
||||
# Variant-2: same fork machinery as `main_thread_forkserver`
|
||||
# but the child enters a sub-interpreter to host its
|
||||
# `trio.run()`. Gated on jcrist/msgspec#1026 unblocking
|
||||
# PEP 684 isolated-mode subints upstream — until then
|
||||
# `subint_forkserver_proc` is a clean `NotImplementedError`
|
||||
# stub pointing at variant-1 (`main_thread_forkserver`) +
|
||||
# the upstream blocker. The key is reserved here (not just
|
||||
# aliased to variant-1) so once upstream lands the impl can
|
||||
# flip in-place without API churn. See
|
||||
# `tractor.spawn._subint_forkserver`.
|
||||
'subint_forkserver',
|
||||
]
|
||||
|
|
|
|||
|
|
@ -18,15 +18,18 @@
|
|||
Variant-2 (future) "subint forkserver" placeholder — reserved
|
||||
for the eventual subint-isolated-child runtime variant.
|
||||
|
||||
> **Status:** placeholder. Today
|
||||
> `--spawn-backend=subint_forkserver` aliases to
|
||||
> `main_thread_forkserver_proc` (variant 1, see
|
||||
> `tractor.spawn._main_thread_forkserver`). A follow-up commit
|
||||
> in this PR series flips the alias to a `NotImplementedError`
|
||||
> stub reserving the `'subint_forkserver'` key for the literal
|
||||
> subint-hosted-child variant once
|
||||
> [jcrist/msgspec#1026](https://github.com/jcrist/msgspec/issues/1026)
|
||||
> unblocks PEP 684 isolated-mode subints upstream.
|
||||
> **Status:** reserved key, stub impl. Today
|
||||
> `--spawn-backend=subint_forkserver` raises a clean
|
||||
> `NotImplementedError` from `subint_forkserver_proc()`
|
||||
> below, pointing at variant-1
|
||||
> (`--spawn-backend=main_thread_forkserver`, see
|
||||
> `tractor.spawn._main_thread_forkserver`) and the upstream
|
||||
> blocker
|
||||
> ([jcrist/msgspec#1026](https://github.com/jcrist/msgspec/issues/1026)).
|
||||
> The key is reserved here (not aliased to variant-1) so the
|
||||
> literal subint-hosted-child impl can flip in-place once
|
||||
> msgspec#1026 unblocks PEP 684 isolated-mode subints
|
||||
> upstream — no API churn at the call site.
|
||||
|
||||
Future arch — what subints would buy us
|
||||
---------------------------------------
|
||||
|
|
|
|||
|
|
@ -6,13 +6,16 @@ prefix-completion treats them as a sub-cmd group — type
|
|||
`acli.<TAB>` to see the full set.
|
||||
|
||||
Provides:
|
||||
- `acli.pytree <pid|pgrep-pat>` psutil-backed proc tree,
|
||||
- `acli.ptree <pid|pgrep-pat>` psutil-backed proc tree,
|
||||
live + zombies split.
|
||||
- `acli.hung_dump <pid|pat> [...]` kernel `wchan`/`stack` +
|
||||
`py-spy dump` (incl `--locals`)
|
||||
for each pid in tree.
|
||||
- `acli.bindspace_scan [<dir>]` find orphaned tractor UDS
|
||||
- `acli.bindspace_scan [<name>|<dir>]` find orphaned tractor UDS
|
||||
sock files (no live owner pid).
|
||||
bare name -> `$XDG_RUNTIME_DIR/<name>`
|
||||
(e.g. `piker`, `tractor`);
|
||||
path -> use as-is.
|
||||
default: `$XDG_RUNTIME_DIR/tractor`.
|
||||
- `acli.reap [opts]` SC-polite zombie-subactor
|
||||
reaper + optional `/dev/shm/`
|
||||
|
|
@ -28,7 +31,7 @@ Or source directly:
|
|||
Pipe-to-paste idiom (xonsh):
|
||||
hung-dump pytest |t /tmp/hung.log
|
||||
|
||||
Requires `psutil` for full functionality (`pytree` and the
|
||||
Requires `psutil` for full functionality (`ptree` and the
|
||||
`hung-dump` tree-walk). Falls back to `pgrep -P` recursion
|
||||
if missing.
|
||||
"""
|
||||
|
|
@ -44,7 +47,7 @@ except ImportError:
|
|||
psutil = None
|
||||
print(
|
||||
'[tractor-diag] `psutil` missing — '
|
||||
'acli.pytree disabled, acli.hung_dump uses pgrep fallback. '
|
||||
'acli.ptree disabled, acli.hung_dump uses pgrep fallback. '
|
||||
'`uv pip install psutil` for full functionality.'
|
||||
)
|
||||
|
||||
|
|
@ -84,7 +87,7 @@ def _walk_tree_with_depth(pid: int):
|
|||
'''
|
||||
Yield `(proc, depth)` pairs walking `pid`'s tree. `depth==0`
|
||||
is the root; `depth==1` are direct children, etc. Used by
|
||||
`pytree` to render parent/child relationships visually.
|
||||
`ptree` to render parent/child relationships visually.
|
||||
'''
|
||||
try:
|
||||
root = psutil.Process(pid)
|
||||
|
|
@ -107,6 +110,56 @@ def _walk_tree_with_depth(pid: int):
|
|||
stack.append((k, d + 1))
|
||||
|
||||
|
||||
def _which_cgroup_slice(pid: int) -> str|None:
|
||||
'''
|
||||
Return which top-level systemd cgroup slice `pid` is
|
||||
rooted in, or `None` if it's not in either:
|
||||
|
||||
- `'system'`: under `/system.slice/...` — typically
|
||||
`.service` units (long-lived daemons explicitly
|
||||
enabled via `systemctl enable`, e.g.
|
||||
`auto-cpufreq.service`, `dbus.service`,
|
||||
`systemd-journald.service`).
|
||||
|
||||
- `'user'`: under `/user.slice/user-<uid>.slice/...`
|
||||
— typically `.scope` units that systemd auto-wraps
|
||||
around desktop-launched apps + login-session
|
||||
procs (e.g. `app-firefox-<id>.scope`,
|
||||
`session-<id>.scope`).
|
||||
|
||||
- `None`: NOT in either slice — pid 1 is NOT
|
||||
managing this proc via cgroup. Combined with
|
||||
`ppid==1`, this is the genuine "leaked / parent
|
||||
died" orphan signal.
|
||||
|
||||
Both slice categories are by-design `ppid==1` (pid 1
|
||||
is actively managing them) and should NOT be flagged
|
||||
as concerning orphans, but distinguishing them is
|
||||
useful: `system.slice` is "real services on this
|
||||
box", `user.slice` is "stuff in your login session".
|
||||
|
||||
Returns `None` on any read error (proc gone, perm
|
||||
denied, non-Linux, etc.) — callers should treat that
|
||||
as "unknown, classify as plain orphan".
|
||||
|
||||
'''
|
||||
try:
|
||||
with open(f'/proc/{pid}/cgroup') as f:
|
||||
cg: str = f.read()
|
||||
except (
|
||||
FileNotFoundError,
|
||||
PermissionError,
|
||||
ProcessLookupError,
|
||||
OSError,
|
||||
):
|
||||
return None
|
||||
if '/system.slice/' in cg:
|
||||
return 'system'
|
||||
if '/user.slice/' in cg:
|
||||
return 'user'
|
||||
return None
|
||||
|
||||
|
||||
def _walk_tree_pgrep(pid: int) -> list:
|
||||
'''psutil-less fallback — recursive `pgrep -P`.'''
|
||||
out = [pid]
|
||||
|
|
@ -157,15 +210,15 @@ def _ensure_sudo_cached() -> bool:
|
|||
return rc == 0
|
||||
|
||||
|
||||
# --- pytree ---------------------------------------------------
|
||||
# --- ptree ---------------------------------------------------
|
||||
|
||||
def _pytree(args):
|
||||
def _ptree(args):
|
||||
'''
|
||||
psutil-backed proc tree; per-proc classification into
|
||||
severity-ordered buckets so leaked / defunct procs
|
||||
don't hide in the noise of normal `live` rows.
|
||||
|
||||
usage: acli.pytree [--tree|-t] <pid|pgrep-pattern> [...]
|
||||
usage: acli.ptree [--tree|-t] <pid|pgrep-pattern> [...]
|
||||
|
||||
classification (per-proc, not per-tree):
|
||||
|
||||
|
|
@ -207,10 +260,10 @@ def _pytree(args):
|
|||
pos_args.append(a)
|
||||
|
||||
if not pos_args:
|
||||
print('usage: acli.pytree [--tree|-t] <pid|pgrep-pattern> [...]')
|
||||
print('usage: acli.ptree [--tree|-t] <pid|pgrep-pattern> [...]')
|
||||
return 1
|
||||
if psutil is None:
|
||||
print('pytree requires psutil; install via `uv pip install psutil`')
|
||||
print('ptree requires psutil; install via `uv pip install psutil`')
|
||||
return 1
|
||||
|
||||
roots: list = []
|
||||
|
|
@ -233,6 +286,15 @@ def _pytree(args):
|
|||
walk_order: list = [] # [(proc, depth)] preserved walk order
|
||||
live: list = [] # [(proc, depth)]
|
||||
orphans: list = []
|
||||
# `ppid==1` AND rooted in `/system.slice/` cgroup —
|
||||
# real systemd-managed services (e.g. `auto-cpufreq`,
|
||||
# `NetworkManager`).
|
||||
system_slice: list = []
|
||||
# `ppid==1` AND rooted in `/user.slice/.../*.scope` —
|
||||
# desktop-launched apps wrapped by systemd-user in
|
||||
# transient `.scope` units (e.g. Firefox, browsers,
|
||||
# editors started from a launcher).
|
||||
user_slice: list = []
|
||||
zombies: list = []
|
||||
gone: list = []
|
||||
|
||||
|
|
@ -252,20 +314,43 @@ def _pytree(args):
|
|||
gone.append(p.pid)
|
||||
continue
|
||||
entry = (p, depth)
|
||||
# severity order: zombie > orphan > live.
|
||||
# severity order:
|
||||
# zombie > orphan > system-slice > user-slice > live
|
||||
# `ppid==1` splits into:
|
||||
# - `system-slice` (rooted in `/system.slice/` —
|
||||
# real services, by-design `ppid==1`)
|
||||
# - `user-slice` (rooted in
|
||||
# `/user.slice/.../*.scope` — desktop apps
|
||||
# wrapped by systemd-user, by-design `ppid==1`)
|
||||
# - `orphans` (everything else with `ppid==1` —
|
||||
# genuinely concerning).
|
||||
if status in defunct_statuses:
|
||||
zombies.append(entry)
|
||||
pid_to_bucket[p.pid] = 'zombies'
|
||||
elif ppid == 1:
|
||||
orphans.append(entry)
|
||||
pid_to_bucket[p.pid] = 'orphans'
|
||||
slice_kind: str|None = _which_cgroup_slice(p.pid)
|
||||
if slice_kind == 'system':
|
||||
system_slice.append(entry)
|
||||
pid_to_bucket[p.pid] = 'system-slice'
|
||||
elif slice_kind == 'user':
|
||||
user_slice.append(entry)
|
||||
pid_to_bucket[p.pid] = 'user-slice'
|
||||
else:
|
||||
orphans.append(entry)
|
||||
pid_to_bucket[p.pid] = 'orphans'
|
||||
else:
|
||||
live.append(entry)
|
||||
pid_to_bucket[p.pid] = 'live'
|
||||
walk_order.append(entry)
|
||||
|
||||
total: int = len(live) + len(orphans) + len(zombies)
|
||||
print(f'# pytree: {total} procs across roots {roots}')
|
||||
total: int = (
|
||||
len(live)
|
||||
+ len(orphans)
|
||||
+ len(system_slice)
|
||||
+ len(user_slice)
|
||||
+ len(zombies)
|
||||
)
|
||||
print(f'# ptree: {total} procs across roots {roots}')
|
||||
|
||||
hdr = ' ' + 'PID'.rjust(7) + ' ' + 'PPID'.rjust(7) + ' '
|
||||
hdr += 'STATUS'.ljust(10) + ' CMD'
|
||||
|
|
@ -370,9 +455,24 @@ def _pytree(args):
|
|||
)
|
||||
_section(
|
||||
'orphans', orphans,
|
||||
'`ppid==1`, reparented to init (leaked / parent gone)',
|
||||
'`ppid==1`, NOT in a `system.slice`/`user.slice` cgroup '
|
||||
'(likely leaked / parent gone)',
|
||||
bucket='orphans',
|
||||
)
|
||||
_section(
|
||||
'system-slice', system_slice,
|
||||
'`ppid==1`, rooted under `/system.slice/` '
|
||||
'(real systemd-managed service — daemon, login '
|
||||
'session manager, etc; not a leak)',
|
||||
bucket='system-slice',
|
||||
)
|
||||
_section(
|
||||
'user-slice', user_slice,
|
||||
'`ppid==1`, rooted under `/user.slice/.../*.scope` '
|
||||
'(desktop-launched app wrapped by systemd-user — '
|
||||
'browser, editor, etc; not a leak)',
|
||||
bucket='user-slice',
|
||||
)
|
||||
_section('live', live, bucket='live')
|
||||
|
||||
if gone:
|
||||
|
|
@ -473,17 +573,35 @@ def _bindspace_scan(args):
|
|||
(those whose embedded `<pid>` no longer corresponds to
|
||||
a live process).
|
||||
|
||||
usage: acli.bindspace_scan [<dir>]
|
||||
default: `$XDG_RUNTIME_DIR/tractor`
|
||||
(or `/run/user/<uid>/tractor`)
|
||||
usage: acli.bindspace_scan [<name>|<dir>]
|
||||
|
||||
- no arg -> `$XDG_RUNTIME_DIR/tractor`
|
||||
(or `/run/user/<uid>/tractor`)
|
||||
- bare `<name>` -> `$XDG_RUNTIME_DIR/<name>`,
|
||||
for projects like `piker` that bind
|
||||
their own sibling sub-dir alongside
|
||||
tractor's default
|
||||
- path (abs or
|
||||
containing `/`) -> use as-is
|
||||
|
||||
'''
|
||||
runtime: str = os.environ.get(
|
||||
'XDG_RUNTIME_DIR',
|
||||
f'/run/user/{os.getuid()}',
|
||||
)
|
||||
if args:
|
||||
bs_dir = Path(args[0])
|
||||
arg: str = args[0]
|
||||
if (
|
||||
arg.startswith('/')
|
||||
or
|
||||
'/' in arg
|
||||
):
|
||||
bs_dir = Path(arg)
|
||||
else:
|
||||
# bare name -> `$XDG_RUNTIME_DIR/<name>` so
|
||||
# callers can say `acli.bindspace_scan piker`
|
||||
bs_dir = Path(runtime) / arg
|
||||
else:
|
||||
runtime = os.environ.get(
|
||||
'XDG_RUNTIME_DIR',
|
||||
f'/run/user/{os.getuid()}',
|
||||
)
|
||||
bs_dir = Path(runtime) / 'tractor'
|
||||
|
||||
if not bs_dir.exists():
|
||||
|
|
@ -493,10 +611,29 @@ def _bindspace_scan(args):
|
|||
socks = sorted(bs_dir.glob('*.sock'))
|
||||
print(f'## bindspace {bs_dir} ({len(socks)} sock file(s))')
|
||||
|
||||
live: list = []
|
||||
orphans: list = []
|
||||
live_active: list = [] # PID alive AND ppid != 1
|
||||
live_orphaned: list = [] # PID alive AND ppid == 1 (init-adopted)
|
||||
dead_orphans: list = [] # PID gone, sock stale
|
||||
bogus: list = []
|
||||
|
||||
def _ppid(pid: int) -> int | None:
|
||||
'''
|
||||
Read `/proc/<pid>/stat` -> ppid. Returns None on race
|
||||
(proc died between `os.kill(pid, 0)` succeeding and this
|
||||
read), permission errors, or non-linux.
|
||||
'''
|
||||
try:
|
||||
with open(f'/proc/{pid}/stat') as f:
|
||||
# field [3] of `man 5 proc` `/proc/<pid>/stat`
|
||||
# NB: field [1] is `(comm)` which can contain
|
||||
# spaces and parens — split from the *last*
|
||||
# `)` to avoid that bullshit.
|
||||
stat: str = f.read()
|
||||
after_comm: str = stat.rsplit(')', 1)[1].strip()
|
||||
return int(after_comm.split()[1]) # state(0) ppid(1)
|
||||
except (FileNotFoundError, PermissionError, ProcessLookupError, OSError):
|
||||
return None
|
||||
|
||||
for s in socks:
|
||||
m = _UDS_SOCK_RE.match(s.name)
|
||||
if not m:
|
||||
|
|
@ -506,39 +643,94 @@ def _bindspace_scan(args):
|
|||
name = m['name']
|
||||
try:
|
||||
os.kill(pid, 0)
|
||||
live.append((s, pid, name))
|
||||
except ProcessLookupError:
|
||||
orphans.append((s, pid, name))
|
||||
dead_orphans.append((s, pid, name))
|
||||
continue
|
||||
except PermissionError:
|
||||
# exists but owned by another user
|
||||
live.append((s, pid, name))
|
||||
# exists but owned by another user — treat as live-active
|
||||
# (we can't read its /proc/<pid>/stat to check ppid)
|
||||
live_active.append((s, pid, name, None))
|
||||
continue
|
||||
|
||||
print(f'\n## live ({len(live)})')
|
||||
if not live:
|
||||
# PID is alive in our euid view; classify by ppid
|
||||
ppid: int | None = _ppid(pid)
|
||||
if ppid == 1:
|
||||
# adopted by init -> the original parent reaped
|
||||
# without cleaning up this sub. Same class as
|
||||
# what `acli.reap` detects.
|
||||
live_orphaned.append((s, pid, name, ppid))
|
||||
else:
|
||||
live_active.append((s, pid, name, ppid))
|
||||
|
||||
print(f'\n## live-active ({len(live_active)}) — PID alive, parent still own it')
|
||||
if not live_active:
|
||||
print(' (none)')
|
||||
for s, pid, name in live:
|
||||
for s, pid, name, ppid in live_active:
|
||||
row = ' ' + str(pid).rjust(7)
|
||||
row += ' ' + name.ljust(32)
|
||||
row += ' ' + s.name
|
||||
if ppid is not None:
|
||||
row += f' (ppid={ppid})'
|
||||
print(row)
|
||||
|
||||
print(f'\n## orphaned ({len(orphans)})')
|
||||
if not orphans:
|
||||
print(
|
||||
f'\n## orphaned-alive ({len(live_orphaned)}) '
|
||||
f'— PID alive but `ppid==1`, parent reaped; '
|
||||
f'`acli.reap` candidate'
|
||||
)
|
||||
if not live_orphaned:
|
||||
print(' (none)')
|
||||
for s, pid, name in orphans:
|
||||
for s, pid, name, ppid in live_orphaned:
|
||||
row = ' ' + str(pid).rjust(7)
|
||||
row += ' ' + name.ljust(32)
|
||||
row += ' ' + s.name + ' (adopted by init)'
|
||||
print(row)
|
||||
|
||||
print(f'\n## orphaned-dead ({len(dead_orphans)}) — PID gone, sock stale')
|
||||
if not dead_orphans:
|
||||
print(' (none)')
|
||||
for s, pid, name in dead_orphans:
|
||||
row = ' ' + str(pid).rjust(7)
|
||||
row += ' ' + name.ljust(32)
|
||||
row += ' ' + s.name + ' (no live proc)'
|
||||
print(row)
|
||||
|
||||
if bogus:
|
||||
print(f'\n## unparseable ({len(bogus)})')
|
||||
print(
|
||||
f'\n## non-tractor ({len(bogus)}) '
|
||||
f'— filename lacks `@<pid>` suffix, '
|
||||
f'cannot determine liveness intrinsically'
|
||||
)
|
||||
for s in bogus:
|
||||
print(f' {s.name}')
|
||||
# show a copy-pastable `ss` cmd per sock so the
|
||||
# caller can resolve listener-PID externally
|
||||
# (e.g. for piker's `chart.sock` / `pikerd.sock`
|
||||
# style flat names). `ss -lpx 'src = <path>'`
|
||||
# prints `users:(("<proc>",pid=<N>,fd=<M>))` for
|
||||
# the listening side; empty output -> nobody's
|
||||
# listening -> safe to unlink.
|
||||
print(
|
||||
'\nto check liveness manually '
|
||||
'(needs `iproute2`/`ss`):'
|
||||
)
|
||||
for s in bogus:
|
||||
print(f" ss -lpx 'src = {s}'")
|
||||
|
||||
if orphans:
|
||||
unlink_cmd = ' '.join(str(o[0]) for o in orphans)
|
||||
print(f'\nto unlink orphans:\n rm {unlink_cmd}')
|
||||
if dead_orphans or live_orphaned:
|
||||
print(
|
||||
'\nto sweep BOTH orphaned-alive subs (graceful '
|
||||
'SIGINT -> SIGKILL) AND dead-orphan socks in one shot:'
|
||||
)
|
||||
print(' acli.reap --uds')
|
||||
|
||||
if dead_orphans:
|
||||
unlink_cmd = ' '.join(str(o[0]) for o in dead_orphans)
|
||||
print(
|
||||
'\n(or to unlink dead-orphan socks manually, '
|
||||
"skipping `acli.reap`'s graceful-cancel ladder:)"
|
||||
)
|
||||
print(f' rm {unlink_cmd}')
|
||||
|
||||
|
||||
# --- acli.reap ------------------------------------------------
|
||||
|
|
@ -633,25 +825,12 @@ def _tractor_reap(args):
|
|||
ns.uds_only
|
||||
)
|
||||
|
||||
# repo-root resolution: `git rev-parse --show-toplevel`
|
||||
# first, falling back to the xontrib file's parent of
|
||||
# parent. mirrors `scripts/tractor-reap._repo_root()`.
|
||||
try:
|
||||
repo_str: str = sp.check_output(
|
||||
['git', 'rev-parse', '--show-toplevel'],
|
||||
stderr=sp.DEVNULL,
|
||||
text=True,
|
||||
).strip()
|
||||
repo: Path = Path(repo_str)
|
||||
except (sp.CalledProcessError, FileNotFoundError):
|
||||
repo: Path = Path(__file__).resolve().parent.parent
|
||||
|
||||
# lazy-import the reap helpers since the package may not
|
||||
# have been on `sys.path` at xontrib-load time (e.g. the
|
||||
# contrib was sourced before activating the venv).
|
||||
import sys
|
||||
if str(repo) not in sys.path:
|
||||
sys.path.insert(0, str(repo))
|
||||
# `tractor` is assumed to be importable in the xonsh env
|
||||
# this xontrib was sourced into (a venv with the package
|
||||
# installed). The standalone `scripts/tractor-reap` does
|
||||
# `git rev-parse --show-toplevel` + `sys.path.insert` for
|
||||
# cold-shell usability — that overhead is unnecessary
|
||||
# here since we're already inside the project's venv.
|
||||
from tractor._testing._reap import (
|
||||
find_descendants,
|
||||
find_orphans,
|
||||
|
|
@ -670,8 +849,12 @@ def _tractor_reap(args):
|
|||
pids: list = find_descendants(ns.parent)
|
||||
mode: str = f'descendants of PPid={ns.parent}'
|
||||
else:
|
||||
pids = find_orphans(repo)
|
||||
mode = f'orphans (PPid=1, cwd={repo})'
|
||||
pids = find_orphans()
|
||||
mode = (
|
||||
'orphans (PPid==1, intrinsic '
|
||||
'cmdline/comm match — `tractor[…]` or '
|
||||
'`tractor._child`)'
|
||||
)
|
||||
|
||||
if not pids:
|
||||
print(f'[acli.reap] no {mode} to reap')
|
||||
|
|
@ -730,7 +913,7 @@ def _tractor_reap(args):
|
|||
# `acli.<TAB>` and the full set is suggested. no parent
|
||||
# `acli` cmd exists — the dot is purely a naming convention.
|
||||
_TCLI_ALIASES: dict = {
|
||||
'acli.pytree': _pytree,
|
||||
'acli.ptree': _ptree,
|
||||
'acli.hung_dump': _hung_dump,
|
||||
'acli.bindspace_scan': _bindspace_scan,
|
||||
'acli.reap': _tractor_reap,
|
||||
|
|
|
|||
Loading…
Reference in New Issue