tractor/ai/prompt-io/claude/20260406T222648Z_f086222_pr...

2.0 KiB
Raw Blame History

Heres the problem. The per-key locking doesnt fix this race because the lock lifetime is decoupled from run_ctxs lifetime. Trace through what happens:

maybe_open_context finally (lines 468-495): 1. users[ctx_key] -= 1 → 0 2. no_more_users.set() → wakes run_ctx (but it doesnt run yet, just becomes ready) 3. _Cache.locks.pop(ctx_key)lock is gone 4. maybe_open_context exits

run_ctx now gets scheduled (lines 237-241): 5. Inner finally: values.pop(ctx_key) → values gone 6. mng.__aexit__() runs → in_aexit.set(), await trio.sleep(10)CHECKPOINT

reenter_during_teardown wakes: 7. Enters maybe_open_context → no lock exists for ctx_key → creates fresh lock, acquires immediately 8. values[ctx_key]KeyError 9. resources.get(ctx_key)still exists! (step 6 hasnt reached resources.pop() in run_ctxs outer finally yet) 10. raise RuntimeError('Caching resources ALREADY exist?!')

The core issue: no_more_users.set() (step 2) and locks.pop() (step 3) happen in the exiting callers task, but the actual resource cleanup (values.pop, acm __aexit__, resources.pop) happens in the run_ctx task inside service_tn. Theres no synchronization between “lock removed, new callers welcome” and “run_ctx has actually finished tearing down”. The per-ctx_key lock change just made the lock more granular — it didnt close this gap.

The fix needs to ensure a new caller cant acquire a lock for ctx_key until run_ctx has fully exited (including resources.pop()). Either: - dont pop the lock in maybe_open_context — let run_ctx release/pop it after resources.pop() - wait on a “teardown complete” event in the cache-miss path when stale resources exist - pop resources before the acm __aexit__ (move resources.pop() into the inner finally alongside values.pop())