Skip to content

fix: Action Not Found Race Condition#36

Closed
alekseysotnikov wants to merge 2 commits into
dynamic-alpha:mainfrom
alekseysotnikov:fix/eliminate-action-not-found-race-condition-with-RwLock
Closed

fix: Action Not Found Race Condition#36
alekseysotnikov wants to merge 2 commits into
dynamic-alpha:mainfrom
alekseysotnikov:fix/eliminate-action-not-found-race-condition-with-RwLock

Conversation

@alekseysotnikov
Copy link
Copy Markdown
Contributor

@alekseysotnikov alekseysotnikov commented Apr 19, 2026

This fix builds on PR #34, which helped identify the root cause.
Originally, I encountered the problem when an action received CustomEvents from a third-party JS component (similar to this datastar example) "concurrently" with another action trigger.

How to reproduce

Load example.app ns and start typing chars fast until get HTTP 500.
Screenshot 2026-04-19 at 15 06 26

Symptom

Intermittent errors:

ERROR hyper.server Action not found 
{:hyper/action-id "a_tab_e62e9fkbq0jyza5sepz82j50ex_1", :hyper/action-ids {}}

The :actions map was completely empty — despite actions being registered on every render.


Root Cause: The Cleanup Gap

The render loop in server.clj performed two operations sequentially:

;; server.clj -renderer-loop! (BEFORE fix)
(actions/cleanup-tab-actions! app-state* tab-id)   ; Step A: wipe ALL actions
(render/render-tab app-state* session-id tab-id)    ; Step B: re-register actions

Between Step A and Step B, :actions is {}. This gap lasts ~1-50ms depending on render complexity.


The Broken Flow (Race Condition Timeline)

Time    Render Thread                          Action Handler Thread
────    ─────────────                          ─────────────────────
T0      cleanup-tab-actions! → :actions = {}
T1                                             User clicks button
T2                                             HTTP POST arrives: action-id = "a_tab_..._1"
T3                                             (get-in @app-state* [:actions id]) → nil!
T4      render-tab starts registering actions  ERROR: "Action not found"
T5      render-tab finishes, :actions has data
T6      send! SSE payload to client

The action handler's lookup at T3 happens during the gap — the action existed at T0 (from the previous render), was wiped at T0, and hasn't been re-registered yet at T4.


Why Simpler Fixes Don't Work

Approach Why It Fails
Remove cleanup Old actions execute mid-render, mutating state while render reads it → inconsistent HTML (part rendered with old state, part with new)
Graceful "not found" Silently drops the action → state mutation is lost → app logic breaks
Spin-wait loop Wastes CPU cycles polling — violates the semaphore pattern already used for efficient thread parking
Move lookup after cleanup Same race — the gap still exists between cleanup and re-registration

The Correct Fix: ReentrantReadWriteLock (FIFO)

Why RwLock specifically:

  1. Read lock is shared — multiple actions execute concurrently (no throughput loss during normal operation)
  2. Write lock is exclusive — blocks ALL actions during cleanup+render (no mid-render mutation)
  3. FIFO mode — prevents render starvation under heavy action load (render gets its turn in queue)
  4. Zero CPU waste — threads park while waiting (same efficiency as the existing semaphore pattern)

The lock must cover the entire action lifecycle — lookup AND execution. If the lookup happens outside the lock, the race still exists.

The Correct Flow (With RwLock)

Time    Render Thread (write lock)               Action Handler Thread (read lock)
────    ────────────────────────                 ─────────────────────────────────
T0      .lock writeLock
T1      cleanup-tab-actions! → :actions = {}     (.lock readLock) ← BLOCKED by write lock
T2      render-tab registers new actions         (still waiting...)
T3      .unlock writeLock                        (acquires read lock)
T4      send! SSE payload                        (get-in @app-state* [:actions id]) → found! ✓
T5                                               (fn client-params) → executes ✓
T6                                               .unlock readLock

The action handler blocks at T1 until the render completes at T3. By then, new actions are registered and the lookup succeeds.


Files Changed

File Change
hyper/server.clj Added ReentrantReadWriteLock import; created lock per tab in -start-renderer!; wrapped cleanup+render in write lock in -renderer-loop!
hyper/actions.clj Acquire read lock, then perform lookup + execution inside the lock

Key Design Decisions

  1. Release write lock after render, before send — Minimizes lock hold time; send! doesn't touch app-state anyway
  2. FIFO fairness mode — Prevents render starvation under heavy action load
  3. Lock covers lookup + execution — The critical fix; prevents the race at the source

Adds `compact-uuids` library and switches frontend-facing IDs from hyphen (`-`) to underscore (`_`) separator (snake case).
## Rationale
- **Snake case:** allow double-click selection of entire ID in browser devtools
- **Compact UUIDs:** 26 chars vs 36 chars (30% smaller), URL-safe, no ambiguous characters (0/O, 1/l/I)
## Changes
1. **Compact UUIDs:** All UUID usages replaced with compact encoding
2. **Snake case:** Only frontend-facing IDs modified - action-id and tab-id
@alekseysotnikov alekseysotnikov force-pushed the fix/eliminate-action-not-found-race-condition-with-RwLock branch 3 times, most recently from 4884393 to 32d62c8 Compare April 19, 2026 12:23
Problem: cleanup-tab-actions! created a gap where :actions was empty
between cleanup and re-registration. In-flight HTTP action requests
during this gap would fail with "Action not found".
Solution: Use ReentrantReadWriteLock (FIFO) per tab to serialize
cleanup+render (write lock) with action execution (read lock).
@alekseysotnikov alekseysotnikov force-pushed the fix/eliminate-action-not-found-race-condition-with-RwLock branch from 32d62c8 to 900e7bb Compare April 19, 2026 15:49
@rschmukler
Copy link
Copy Markdown
Contributor

Thanks for the detailed write up and detective work. Been traveling for the weekend but will review ASAP.

@rschmukler
Copy link
Copy Markdown
Contributor

rschmukler commented Apr 21, 2026

Hey @alekseysotnikov thanks again for finding this bug! I've fixed it in dc317a3.

Since the renderer is a dedicated virtual thread and action IDs are deterministic, the issue really only existed because we deleted actions before re-registering them. Rather than introducing locks around it, I was able to just render first (which re-registers all live actions in place), then atomically swap out any any stale IDs that weren't re-registered.

Thanks again for finding the bug!

rschmukler added a commit that referenced this pull request Apr 21, 2026
Replace the cleanup-before-render pattern that created a gap where
actions were missing between cleanup and re-registration.

Instead, collect registered action IDs during render via a new
*registered-action-ids* dynamic var, then atomically sweep only
stale actions after render completes. No window where actions are
absent, no locks needed.

Closes #36
@alekseysotnikov
Copy link
Copy Markdown
Contributor Author

alekseysotnikov commented Apr 22, 2026

Thanks. I have reviewed your commit, and it aligns with my initial approach. However, I later realized there's a second important point to consider: since a single action can update multiple cursors, the renderer might run mid-execution and broadcast an unintended intermediate state, ie inconsistent HTML, part rendered with old state, part with new. For example:

  1. Action starts
  2. Action updates cursor1
  3. Renderer runs and sends an SSE update
  4. Action updates cursor2
  5. Action completes

This means the renderer could emit a partial state that wasn't designed to be exposed. Local signals or reactive handlers might then respond incorrectly to this intermediate state. The only way to prevent this would be if actions guaranteed atomic state updates, which they currently don't. Thats why locks were added to the renderer.
What do you think? Could this edge case pose a risk to real user scenarios?

@rschmukler
Copy link
Copy Markdown
Contributor

That's a great thought. I wonder if either some sort of batching mechanism or STM could be used to help ensure consistent rendering. I will give it some more thought.

In reality due to throttling of the render loop it is less likely to be hit, but certainly could be.

That being said, a user could get atomicity guarantees by storing both pieces of state inside the cursor instead of using two cursors - and in a way this is very similar to regular Clojure as well.

@alekseysotnikov
Copy link
Copy Markdown
Contributor Author

alekseysotnikov commented Apr 22, 2026

a user could get atomicity guarantees by storing both pieces of state inside the cursor instead of using two cursors

Sure, but it is a workaround in the current implementation. It is unclear from the beginning, meaning a user could get hurt here. Also it forces to have more gymnastic with keywords.

In reality due to throttling of the render loop it is less likely to be hit, but certainly could be.

I also think that in worse cases, this could have a cumulative effect that strikes too rarely and is hard to debug.

@rschmukler
Copy link
Copy Markdown
Contributor

@alekseysotnikov I saw some of the stuff around the PR on your fork and it prompted me to think about this a bit more. I really think that globally locking ends up being the wrong choice:

  1. Slow actions block rendering. If an action does I/O (HTTP call, DB write) while holding the read lock, the renderer can't proceed until the action completes. The user sees no UI updates for the duration.

  2. Convoy effect under load. With a fair RwLock, once the renderer requests the write lock, all subsequent action read-lock requests queue behind it. A single slow action holds up the renderer, and the pending renderer holds up every other incoming action for that tab. Throughput collapses to serial execution.

  3. Prevents intentional intermediate state. When an action sets :loading before an HTTP call, you want the renderer to show that state. The lock prevents this - the UI is frozen until the entire action completes.

Instead I introduced some new code in 3c6e730 that causes a consistent picture of all cursors to be presented at the time that we start the render cycle.

Let me know what your thoughts are and thanks again for your investigation in this to begin with!

@alekseysotnikov
Copy link
Copy Markdown
Contributor Author

alekseysotnikov commented May 6, 2026

I agree with your concerns about the consequences of locking — this has been bothering me lately as well, and it’s pushing me to think in the direction of something like an optional lock-free action mode (which complicates usability).

Your approach is really good, but there’s still a case where intermediate state can appear on the client side:

Case: the renderer starts and takes a snapshot in the middle of Action A’s execution.  
The snapshot' includes mutation A1 but does not include A2.

┌─────────┐        ┌──────────┐        ┌───────────┐
│ Action A│        │ Renderer │        │  Client   │
└────┬────┘        └────┬─────┘        └─────┬─────┘
     │                   │                    │
     │  A starts         │                    │
     ├──────────────────>│                    │
     │                   │                    │
     │ update cursor A1  │                    │
     ├──────────────────>│                    │
     │                   │ snapshot' -> SSE   │
     │                   ├───────────────────>│
     │                   │                    │
     │ update cursor A2  │                    │
     ├──────────────────>│                    │
     │                   │                    │
     │ A completes       │                    │
     └──────────────────>│                    │

@rschmukler
Copy link
Copy Markdown
Contributor

@alekseysotnikov I think the solution will be to introduce a batch macro to allow updating multiple cursors in a single atomic operation. The reason that I think we want this distinction is because there are cases (such as a progress bar) where we want an action handler to be able to be able to transmit intermediate cursor updates

@alekseysotnikov
Copy link
Copy Markdown
Contributor Author

This 100% reasonable. If batch were allowed to perform side effects inside its body, that would be ideal. Then the user could simply wrap the entire body of an action in a batch and gain a guarantee of an atomic render of mutations.

superficial code:
(h/action 
  (h/batch
    ...cursor A mutation
    ...storage query...
    ...cursor B mutation
))

So this way locks are not needed for sure

@rschmukler
Copy link
Copy Markdown
Contributor

@alekseysotnikov Just shipped h/batch Inside a batch, cursor writes accumulate in a shadow atom and flush to app-state* in a single swap! at the end, so the renderer never sees an intermediate state between mutations.

(h/action
  (reset! loading* true)        ;; immediate — renderer shows spinner
  (let [data (fetch-from-api!)]
    (h/batch                     ;; atomic — lands as one swap
      (reset! data* data)
      (reset! loading* false))))

It's opt-in, so actions that want intermediate state (progress bars, loading spinners) keep working exactly as before — just don't wrap those writes in batch. A pending batch operation doesn't block the renderer it just doesn't propagate it outside of the context until the batch resolves.

Under the hood it reuses the same overlay mechanism we already had for render snapshots, so there's no new concurrency primitive, just a dynamic binding and a flush function.

Thanks again for all of the back and forth on this. Hopefully this covers your use case!

@alekseysotnikov
Copy link
Copy Markdown
Contributor Author

WOW, that’s really amazing! The project is designed incredibly well. Thank you again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants