fix: Action Not Found Race Condition by alekseysotnikov · Pull Request #36 · dynamic-alpha/hyper

alekseysotnikov · 2026-04-19T07:45:28Z

This fix builds on PR #34, which helped identify the root cause.
Originally, I encountered the problem when an action received CustomEvents from a third-party JS component (similar to this datastar example) "concurrently" with another action trigger.

How to reproduce

Load example.app ns and start typing chars fast until get HTTP 500.

Symptom

Intermittent errors:

ERROR hyper.server Action not found 
{:hyper/action-id "a_tab_e62e9fkbq0jyza5sepz82j50ex_1", :hyper/action-ids {}}

The :actions map was completely empty — despite actions being registered on every render.

Root Cause: The Cleanup Gap

The render loop in server.clj performed two operations sequentially:

;; server.clj -renderer-loop! (BEFORE fix)
(actions/cleanup-tab-actions! app-state* tab-id)   ; Step A: wipe ALL actions
(render/render-tab app-state* session-id tab-id)    ; Step B: re-register actions

Between Step A and Step B, :actions is {}. This gap lasts ~1-50ms depending on render complexity.

The Broken Flow (Race Condition Timeline)

Time    Render Thread                          Action Handler Thread
────    ─────────────                          ─────────────────────
T0      cleanup-tab-actions! → :actions = {}
T1                                             User clicks button
T2                                             HTTP POST arrives: action-id = "a_tab_..._1"
T3                                             (get-in @app-state* [:actions id]) → nil!
T4      render-tab starts registering actions  ERROR: "Action not found"
T5      render-tab finishes, :actions has data
T6      send! SSE payload to client

The action handler's lookup at T3 happens during the gap — the action existed at T0 (from the previous render), was wiped at T0, and hasn't been re-registered yet at T4.

Why Simpler Fixes Don't Work

Approach	Why It Fails
Remove cleanup	Old actions execute mid-render, mutating state while render reads it → inconsistent HTML (part rendered with old state, part with new)
Graceful "not found"	Silently drops the action → state mutation is lost → app logic breaks
Spin-wait loop	Wastes CPU cycles polling — violates the semaphore pattern already used for efficient thread parking
Move lookup after cleanup	Same race — the gap still exists between cleanup and re-registration

The Correct Fix: ReentrantReadWriteLock (FIFO)

Why RwLock specifically:

Read lock is shared — multiple actions execute concurrently (no throughput loss during normal operation)
Write lock is exclusive — blocks ALL actions during cleanup+render (no mid-render mutation)
FIFO mode — prevents render starvation under heavy action load (render gets its turn in queue)
Zero CPU waste — threads park while waiting (same efficiency as the existing semaphore pattern)

The lock must cover the entire action lifecycle — lookup AND execution. If the lookup happens outside the lock, the race still exists.

The Correct Flow (With RwLock)

Time    Render Thread (write lock)               Action Handler Thread (read lock)
────    ────────────────────────                 ─────────────────────────────────
T0      .lock writeLock
T1      cleanup-tab-actions! → :actions = {}     (.lock readLock) ← BLOCKED by write lock
T2      render-tab registers new actions         (still waiting...)
T3      .unlock writeLock                        (acquires read lock)
T4      send! SSE payload                        (get-in @app-state* [:actions id]) → found! ✓
T5                                               (fn client-params) → executes ✓
T6                                               .unlock readLock

The action handler blocks at T1 until the render completes at T3. By then, new actions are registered and the lookup succeeds.

Files Changed

File	Change
`hyper/server.clj`	Added `ReentrantReadWriteLock` import; created lock per tab in `-start-renderer!`; wrapped cleanup+render in write lock in `-renderer-loop!`
`hyper/actions.clj`	Acquire read lock, then perform lookup + execution inside the lock

Key Design Decisions

Release write lock after render, before send — Minimizes lock hold time; send! doesn't touch app-state anyway
FIFO fairness mode — Prevents render starvation under heavy action load
Lock covers lookup + execution — The critical fix; prevents the race at the source

Adds `compact-uuids` library and switches frontend-facing IDs from hyphen (`-`) to underscore (`_`) separator (snake case). ## Rationale - **Snake case:** allow double-click selection of entire ID in browser devtools - **Compact UUIDs:** 26 chars vs 36 chars (30% smaller), URL-safe, no ambiguous characters (0/O, 1/l/I) ## Changes 1. **Compact UUIDs:** All UUID usages replaced with compact encoding 2. **Snake case:** Only frontend-facing IDs modified - action-id and tab-id

Problem: cleanup-tab-actions! created a gap where :actions was empty between cleanup and re-registration. In-flight HTTP action requests during this gap would fail with "Action not found". Solution: Use ReentrantReadWriteLock (FIFO) per tab to serialize cleanup+render (write lock) with action execution (read lock).

rschmukler · 2026-04-19T22:20:37Z

Thanks for the detailed write up and detective work. Been traveling for the weekend but will review ASAP.

rschmukler · 2026-04-21T17:58:51Z

Hey @alekseysotnikov thanks again for finding this bug! I've fixed it in dc317a3.

Since the renderer is a dedicated virtual thread and action IDs are deterministic, the issue really only existed because we deleted actions before re-registering them. Rather than introducing locks around it, I was able to just render first (which re-registers all live actions in place), then atomically swap out any any stale IDs that weren't re-registered.

Thanks again for finding the bug!

Replace the cleanup-before-render pattern that created a gap where actions were missing between cleanup and re-registration. Instead, collect registered action IDs during render via a new *registered-action-ids* dynamic var, then atomically sweep only stale actions after render completes. No window where actions are absent, no locks needed. Closes #36

alekseysotnikov · 2026-04-22T06:36:59Z

Thanks. I have reviewed your commit, and it aligns with my initial approach. However, I later realized there's a second important point to consider: since a single action can update multiple cursors, the renderer might run mid-execution and broadcast an unintended intermediate state, ie inconsistent HTML, part rendered with old state, part with new. For example:

Action starts
Action updates cursor1
Renderer runs and sends an SSE update
Action updates cursor2
Action completes

This means the renderer could emit a partial state that wasn't designed to be exposed. Local signals or reactive handlers might then respond incorrectly to this intermediate state. The only way to prevent this would be if actions guaranteed atomic state updates, which they currently don't. Thats why locks were added to the renderer.
What do you think? Could this edge case pose a risk to real user scenarios?

rschmukler · 2026-04-22T13:15:17Z

That's a great thought. I wonder if either some sort of batching mechanism or STM could be used to help ensure consistent rendering. I will give it some more thought.

In reality due to throttling of the render loop it is less likely to be hit, but certainly could be.

That being said, a user could get atomicity guarantees by storing both pieces of state inside the cursor instead of using two cursors - and in a way this is very similar to regular Clojure as well.

alekseysotnikov · 2026-04-22T13:36:19Z

a user could get atomicity guarantees by storing both pieces of state inside the cursor instead of using two cursors

Sure, but it is a workaround in the current implementation. It is unclear from the beginning, meaning a user could get hurt here. Also it forces to have more gymnastic with keywords.

In reality due to throttling of the render loop it is less likely to be hit, but certainly could be.

I also think that in worse cases, this could have a cumulative effect that strikes too rarely and is hard to debug.

rschmukler · 2026-05-05T22:31:23Z

@alekseysotnikov I saw some of the stuff around the PR on your fork and it prompted me to think about this a bit more. I really think that globally locking ends up being the wrong choice:

Slow actions block rendering. If an action does I/O (HTTP call, DB write) while holding the read lock, the renderer can't proceed until the action completes. The user sees no UI updates for the duration.
Convoy effect under load. With a fair RwLock, once the renderer requests the write lock, all subsequent action read-lock requests queue behind it. A single slow action holds up the renderer, and the pending renderer holds up every other incoming action for that tab. Throughput collapses to serial execution.
Prevents intentional intermediate state. When an action sets :loading before an HTTP call, you want the renderer to show that state. The lock prevents this - the UI is frozen until the entire action completes.

Instead I introduced some new code in 3c6e730 that causes a consistent picture of all cursors to be presented at the time that we start the render cycle.

Let me know what your thoughts are and thanks again for your investigation in this to begin with!

alekseysotnikov · 2026-05-06T10:00:40Z

I agree with your concerns about the consequences of locking — this has been bothering me lately as well, and it’s pushing me to think in the direction of something like an optional lock-free action mode (which complicates usability).

Your approach is really good, but there’s still a case where intermediate state can appear on the client side:

Case: the renderer starts and takes a snapshot in the middle of Action A’s execution.  
The snapshot' includes mutation A1 but does not include A2.

┌─────────┐        ┌──────────┐        ┌───────────┐
│ Action A│        │ Renderer │        │  Client   │
└────┬────┘        └────┬─────┘        └─────┬─────┘
     │                   │                    │
     │  A starts         │                    │
     ├──────────────────>│                    │
     │                   │                    │
     │ update cursor A1  │                    │
     ├──────────────────>│                    │
     │                   │ snapshot' -> SSE   │
     │                   ├───────────────────>│
     │                   │                    │
     │ update cursor A2  │                    │
     ├──────────────────>│                    │
     │                   │                    │
     │ A completes       │                    │
     └──────────────────>│                    │

rschmukler · 2026-05-06T14:13:58Z

@alekseysotnikov I think the solution will be to introduce a batch macro to allow updating multiple cursors in a single atomic operation. The reason that I think we want this distinction is because there are cases (such as a progress bar) where we want an action handler to be able to be able to transmit intermediate cursor updates

alekseysotnikov · 2026-05-06T14:50:21Z

This 100% reasonable. If batch were allowed to perform side effects inside its body, that would be ideal. Then the user could simply wrap the entire body of an action in a batch and gain a guarantee of an atomic render of mutations.

superficial code:
(h/action 
  (h/batch
    ...cursor A mutation
    ...storage query...
    ...cursor B mutation
))

So this way locks are not needed for sure

rschmukler · 2026-05-06T17:59:40Z

@alekseysotnikov Just shipped h/batch Inside a batch, cursor writes accumulate in a shadow atom and flush to app-state* in a single swap! at the end, so the renderer never sees an intermediate state between mutations.

(h/action
  (reset! loading* true)        ;; immediate — renderer shows spinner
  (let [data (fetch-from-api!)]
    (h/batch                     ;; atomic — lands as one swap
      (reset! data* data)
      (reset! loading* false))))

It's opt-in, so actions that want intermediate state (progress bars, loading spinners) keep working exactly as before — just don't wrap those writes in batch. A pending batch operation doesn't block the renderer it just doesn't propagate it outside of the context until the batch resolves.

Under the hood it reuses the same overlay mechanism we already had for render snapshots, so there's no new concurrency primitive, just a dynamic binding and a flush function.

Thanks again for all of the back and forth on this. Hopefully this covers your use case!

alekseysotnikov · 2026-05-07T07:34:52Z

WOW, that’s really amazing! The project is designed incredibly well. Thank you again

alekseysotnikov force-pushed the fix/eliminate-action-not-found-race-condition-with-RwLock branch 3 times, most recently from 4884393 to 32d62c8 Compare April 19, 2026 12:23

alekseysotnikov force-pushed the fix/eliminate-action-not-found-race-condition-with-RwLock branch from 32d62c8 to 900e7bb Compare April 19, 2026 15:49

rschmukler closed this in 570464f Apr 21, 2026

This was referenced May 4, 2026

Fix/eliminate action not found race condition with rw lock #39

Closed

Fix/eliminate action not found race condition with rw lock alekseysotnikov/hyper#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Action Not Found Race Condition#36

fix: Action Not Found Race Condition#36
alekseysotnikov wants to merge 2 commits into
dynamic-alpha:mainfrom
alekseysotnikov:fix/eliminate-action-not-found-race-condition-with-RwLock

alekseysotnikov commented Apr 19, 2026 •

edited

Loading

Uh oh!

rschmukler commented Apr 19, 2026

Uh oh!

rschmukler commented Apr 21, 2026 •

edited

Loading

Uh oh!

alekseysotnikov commented Apr 22, 2026 •

edited

Loading

Uh oh!

rschmukler commented Apr 22, 2026

Uh oh!

alekseysotnikov commented Apr 22, 2026 •

edited

Loading

Uh oh!

rschmukler commented May 5, 2026

Uh oh!

alekseysotnikov commented May 6, 2026 •

edited

Loading

Uh oh!

rschmukler commented May 6, 2026

Uh oh!

alekseysotnikov commented May 6, 2026

Uh oh!

rschmukler commented May 6, 2026

Uh oh!

alekseysotnikov commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alekseysotnikov commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to reproduce

Symptom

Root Cause: The Cleanup Gap

The Broken Flow (Race Condition Timeline)

Why Simpler Fixes Don't Work

The Correct Fix: ReentrantReadWriteLock (FIFO)

Why RwLock specifically:

The Correct Flow (With RwLock)

Files Changed

Key Design Decisions

Uh oh!

rschmukler commented Apr 19, 2026

Uh oh!

rschmukler commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alekseysotnikov commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschmukler commented Apr 22, 2026

Uh oh!

alekseysotnikov commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschmukler commented May 5, 2026

Uh oh!

alekseysotnikov commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschmukler commented May 6, 2026

Uh oh!

alekseysotnikov commented May 6, 2026

Uh oh!

rschmukler commented May 6, 2026

Uh oh!

alekseysotnikov commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alekseysotnikov commented Apr 19, 2026 •

edited

Loading

rschmukler commented Apr 21, 2026 •

edited

Loading

alekseysotnikov commented Apr 22, 2026 •

edited

Loading

alekseysotnikov commented Apr 22, 2026 •

edited

Loading

alekseysotnikov commented May 6, 2026 •

edited

Loading