diff --git a/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md b/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md new file mode 100644 index 00000000..c6a593c4 --- /dev/null +++ b/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md @@ -0,0 +1,52 @@ +--- +model: claude-opus-4-6 +service: claude +session: (ad-hoc, not tracked via conf.toml) +timestamp: 2026-04-06T22:26:48Z +git_ref: f086222 +scope: code +substantive: true +raw_file: 20260406T222648Z_f086222_prompt_io.raw.md +--- + +## Prompt + +User applied the per-key locking patch (`f086222d`), +removed the `@pytest.mark.xfail` from +`test_moc_reentry_during_teardown`, and found the +test STILL fails — now with `RuntimeError` instead +of `AssertionError`. Asked to diagnose why the fix +doesn't actually resolve this race. + +## Response summary + +Traced the exact scheduling sequence across three +concurrent tasks to show that the per-`ctx_key` +locking change doesn't close the fundamental gap: + +- `maybe_open_context`'s finally pops the lock and + sets `no_more_users` in the **exiting caller's + task** +- But the actual resource cleanup (`values.pop`, + acm `__aexit__`, `resources.pop`) runs in the + **`run_ctx` task** inside `service_tn` +- A re-entering task creates a fresh lock (old one + popped), acquires immediately, and hits stale + `resources[ctx_key]` while `run_ctx` is still in + `__aexit__` + +Identified three possible fix strategies: +1. Let `run_ctx` own lock release/pop after + `resources.pop()` +2. Add a "teardown complete" event that new callers + wait on +3. Move `resources.pop()` into `run_ctx`'s inner + finally (before acm `__aexit__`) + +## Files changed + +No code changes — analysis only. + +## Human edits + +N/A — diagnostic output, no code generated. diff --git a/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md b/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md new file mode 100644 index 00000000..95edeedf --- /dev/null +++ b/ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md @@ -0,0 +1,31 @@ +--- +model: claude-opus-4-6 +service: claude +timestamp: 2026-04-06T22:26:48Z +git_ref: f086222 +--- + +Here's the problem. The per-key locking doesn't fix this race because the **lock lifetime is decoupled from `run_ctx`'s lifetime**. Trace through what happens: + +**`maybe_open_context` finally** (lines 468-495): +1. `users[ctx_key] -= 1` → 0 +2. `no_more_users.set()` → wakes `run_ctx` (but it doesn't run yet, just becomes ready) +3. `_Cache.locks.pop(ctx_key)` → **lock is gone** +4. `maybe_open_context` exits + +**`run_ctx`** now gets scheduled (lines 237-241): +5. Inner finally: `values.pop(ctx_key)` → values gone +6. `mng.__aexit__()` runs → `in_aexit.set()`, `await trio.sleep(10)` → **CHECKPOINT** + +**`reenter_during_teardown`** wakes: +7. Enters `maybe_open_context` → no lock exists for `ctx_key` → creates **fresh** lock, acquires immediately +8. `values[ctx_key]` → `KeyError` +9. `resources.get(ctx_key)` → **still exists!** (step 6 hasn't reached `resources.pop()` in `run_ctx`'s outer finally yet) +10. `raise RuntimeError('Caching resources ALREADY exist?!')` + +The core issue: `no_more_users.set()` (step 2) and `locks.pop()` (step 3) happen in the **exiting caller's task**, but the actual resource cleanup (`values.pop`, acm `__aexit__`, `resources.pop`) happens in the **`run_ctx` task** inside `service_tn`. There's no synchronization between "lock removed, new callers welcome" and "run_ctx has actually finished tearing down". The per-`ctx_key` lock change just made the lock more granular — it didn't close this gap. + +The fix needs to ensure a new caller can't acquire a lock for `ctx_key` until `run_ctx` has fully exited (including `resources.pop()`). Either: +- **don't pop the lock in `maybe_open_context`** — let `run_ctx` release/pop it after `resources.pop()` +- **wait on a "teardown complete" event** in the cache-miss path when stale `resources` exist +- **pop `resources` before the acm `__aexit__`** (move `resources.pop()` into the inner finally alongside `values.pop()`)