Add prompt-io log for `run_ctx` teardown analysis

Documents the diagnostic session tracing why per-`ctx_key` locking alone doesn't close the `_Cache.run_ctx` teardown race — the lock pops in the exiting caller's task but resource cleanup runs in the `run_ctx` task inside `service_tn`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-06 18:37:14 -04:00 · 2026-04-06 18:37:14 -04:00 · da7e749b09
parent 03c42e1333
commit da7e749b09
2 changed files with 83 additions and 0 deletions
--- a/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md
+++ b/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md
@ -0,0 +1,52 @@
+---
+model: claude-opus-4-6
+service: claude
+session: (ad-hoc, not tracked via conf.toml)
+timestamp: 2026-04-06T22:26:48Z
+git_ref: f086222
+scope: code
+substantive: true
+raw_file: 20260406T222648Z_f086222_prompt_io.raw.md
+---
+
+## Prompt
+
+User applied the per-key locking patch (`f086222d`),
+removed the `@pytest.mark.xfail` from
+`test_moc_reentry_during_teardown`, and found the
+test STILL fails — now with `RuntimeError` instead
+of `AssertionError`. Asked to diagnose why the fix
+doesn't actually resolve this race.
+
+## Response summary
+
+Traced the exact scheduling sequence across three
+concurrent tasks to show that the per-`ctx_key`
+locking change doesn't close the fundamental gap:
+
+- `maybe_open_context`'s finally pops the lock and
+  sets `no_more_users` in the **exiting caller's
+  task**
+- But the actual resource cleanup (`values.pop`,
+  acm `__aexit__`, `resources.pop`) runs in the
+  **`run_ctx` task** inside `service_tn`
+- A re-entering task creates a fresh lock (old one
+  popped), acquires immediately, and hits stale
+  `resources[ctx_key]` while `run_ctx` is still in
+  `__aexit__`
+
+Identified three possible fix strategies:
+1. Let `run_ctx` own lock release/pop after
+   `resources.pop()`
+2. Add a "teardown complete" event that new callers
+   wait on
+3. Move `resources.pop()` into `run_ctx`'s inner
+   finally (before acm `__aexit__`)
+
+## Files changed
+
+No code changes — analysis only.
+
+## Human edits
+
+N/A — diagnostic output, no code generated.
--- a/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md
+++ b/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md
@ -0,0 +1,31 @@
+---
+model: claude-opus-4-6
+service: claude
+timestamp: 2026-04-06T22:26:48Z
+git_ref: f086222
+---
+
+Here's the problem. The per-key locking doesn't fix this race because the **lock lifetime is decoupled from `run_ctx`'s lifetime**. Trace through what happens:
+
+**`maybe_open_context` finally** (lines 468-495):
+1. `users[ctx_key] -= 1` → 0
+2. `no_more_users.set()` → wakes `run_ctx` (but it doesn't run yet, just becomes ready)
+3. `_Cache.locks.pop(ctx_key)` → **lock is gone**
+4. `maybe_open_context` exits
+
+**`run_ctx`** now gets scheduled (lines 237-241):
+5. Inner finally: `values.pop(ctx_key)` → values gone
+6. `mng.__aexit__()` runs → `in_aexit.set()`, `await trio.sleep(10)` → **CHECKPOINT**
+
+**`reenter_during_teardown`** wakes:
+7. Enters `maybe_open_context` → no lock exists for `ctx_key` → creates **fresh** lock, acquires immediately
+8. `values[ctx_key]` → `KeyError`
+9. `resources.get(ctx_key)` → **still exists!** (step 6 hasn't reached `resources.pop()` in `run_ctx`'s outer finally yet)
+10. `raise RuntimeError('Caching resources ALREADY exist?!')`
+
+The core issue: `no_more_users.set()` (step 2) and `locks.pop()` (step 3) happen in the **exiting caller's task**, but the actual resource cleanup (`values.pop`, acm `__aexit__`, `resources.pop`) happens in the **`run_ctx` task** inside `service_tn`. There's no synchronization between "lock removed, new callers welcome" and "run_ctx has actually finished tearing down". The per-`ctx_key` lock change just made the lock more granular — it didn't close this gap.
+
+The fix needs to ensure a new caller can't acquire a lock for `ctx_key` until `run_ctx` has fully exited (including `resources.pop()`). Either:
+- **don't pop the lock in `maybe_open_context`** — let `run_ctx` release/pop it after `resources.pop()`
+- **wait on a "teardown complete" event** in the cache-miss path when stale `resources` exist
+- **pop `resources` before the acm `__aexit__`** (move `resources.pop()` into the inner finally alongside `values.pop()`)