Skip to content

Fixes #39277 - Fix TopbarSweeper thread-safety race condition under Puma#10969

Open
pablomh wants to merge 1 commit intotheforeman:developfrom
pablomh:fix/topbar-sweeper-thread-safety
Open

Fixes #39277 - Fix TopbarSweeper thread-safety race condition under Puma#10969
pablomh wants to merge 1 commit intotheforeman:developfrom
pablomh:fix/topbar-sweeper-thread-safety

Conversation

@pablomh
Copy link
Copy Markdown
Contributor

@pablomh pablomh commented Apr 29, 2026

TopbarSweeper uses include Singleton with a shared controller attribute
across all Puma threads within each worker. Under concurrent requests, one
thread ensure block nils the controller while another thread is actively
using it:

Thread A                           Thread B
set_topbar_sweeper_controller:
  controller = self
  yield -> action runs...
                                   set_topbar_sweeper_controller:
                                     controller = self
                                     yield -> action runs...
                                   ensure:
                                     controller = nil  <- clobbers Thread A
  set_taxonomy -> expire_cache:
    controller.expire_fragment(...)
    -> NoMethodError: undefined method expire_fragment for nil:NilClass

The existing if controller.present? guard has a TOCTOU (time-of-check
to time-of-use) race: the check passes, then another thread nils the
value before expire_fragment is called.

This bug is hard to reproduce under typical load because the Singleton is
per-process (each Puma worker has its own), so only the threads within a
single worker compete. With the default 5 threads per worker, the race
window is extremely narrow. It becomes observable under high concurrency
(many simultaneous registrations) or with increased threads per worker,
where more threads share the same Singleton and the interleaving becomes
likely. When it does trigger, it produces a single 500 error that is
easily dismissed as transient noise in production logs.

Discovered during concurrent host registration performance testing.
The error appears in production.log at topbar_sweeper.rb:6.

Changes

Replace the Singleton with Thread.current storage, following the same
pattern already used by ThreadSession for User.current,
Organization.current, and Location.current. Each Puma thread gets
its own controller reference - no cross-thread clobbering.

  • app/services/topbar_sweeper.rb - Remove include Singleton, store
    controller in Thread.current[:topbar_sweeper_controller], use safe
    navigation (&.) instead of TOCTOU if controller.present? guard
  • app/controllers/concerns/foreman/controller/topbar_sweeper.rb -
    .instance.controller -> .controller
  • app/models/concerns/topbar_cache_expiry.rb - same
  • test/unit/topbar_sweeper_test.rb - NEW: thread isolation and nil
    safety tests

How to test

Prerequisites

Foreman environment with Puma in multi-threaded mode (default).

Steps

  1. Run the new unit test:
    bundle exec rake test TEST=test/unit/topbar_sweeper_test.rb
  2. Run the existing integration test:
    bundle exec rake test TEST=test/integration/top_bar_test.rb
  3. To reproduce the original race condition: register many hosts
    concurrently and grep for expire_fragment errors in production.log
    • should no longer appear.

@pablomh pablomh force-pushed the fix/topbar-sweeper-thread-safety branch from f0b2351 to e64332d Compare April 29, 2026 17:54
Copy link
Copy Markdown
Member

@chris1984 chris1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small comments, will let someone else look it over as well. @ShimShtein

Comment thread test/unit/topbar_sweeper_test.rb
Comment thread app/models/concerns/topbar_cache_expiry.rb Outdated
TopbarSweeper uses a Singleton with a shared controller attribute
across all Puma threads. Under concurrent requests, one thread ensure
block nils the controller while another thread is still using it,
causing NoMethodError on expire_fragment.

Replace the Singleton with Thread.current storage, following the same
pattern used by ThreadSession for User.current/Organization.current.
Each thread gets its own controller reference - no cross-thread
clobbering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pablomh pablomh force-pushed the fix/topbar-sweeper-thread-safety branch from e64332d to 1ccd744 Compare May 4, 2026 20:33
Copy link
Copy Markdown
Member

@chris1984 chris1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the things I suggested and the test. Will let @ShimShtein give a final look.

@ShimShtein
Copy link
Copy Markdown
Member

Overall the PR looks good. If I understand correctly, you have moved the controller cache handling into each thread.
My only question is: since there are much more threads than processes, did you notice a memory consumption degradation? Just from the top of my head, we are creating a reference that will exist to the whole lifetime of the thread, and I am not sure what object tree spans from that reference (nothing will be GC'ed under this tree). Such long-lived objects can create a certain strain on the GC (the objects become quite permanent in memory, not allowing the GC to properly reuse the pages.

@pablomh
Copy link
Copy Markdown
Contributor Author

pablomh commented May 5, 2026

The Thread.current reference has the same lifecycle as the old Singleton attribute — set at request start (around_action), nil'd in ensure at request end. Nothing persists between requests. This is the same pattern ThreadSession uses for User.current and Organization.current.

We initially thought this was only observable under extreme concurrency, but looking more closely at the data, that's not quite right. We found it during concurrent host registration testing on a 16-CPU Satellite. At the default Puma config (24 workers × 5 threads = 120 slots), the race triggered at 760+ concurrent hosts. When we increased to 16 workers × 12 threads = 192 slots, it triggered at just 456 concurrent — more threads per worker widens the race window.

What's interesting is why this hasn't been seen before, given the Puma config hasn't changed. We think it's because our other registration performance patches (fact import reduction, Candlepin caching) made Puma significantly more efficient — requests complete faster, so threads cycle through the around_action ensure block more frequently, and more requests succeed and reach the finalize_registrationset_taxonomyexpire_cache code path where the race lives. The old inefficiency was accidentally masking the bug: requests were too slow and too many timed out to ever trigger the collision often enough to notice.

Regarding memory: comparing Puma RSS across multiple runs with and without this fix, we see a ~0.7 GB Puma RSS increase — but that's the combined effect of all patches in the batch (caching layers, connection pooling, this fix). The Thread.current change itself adds one pointer per thread that's nil between requests, so its contribution is negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants