Fixes #39277 - Fix TopbarSweeper thread-safety race condition under Puma by pablomh · Pull Request #10969 · theforeman/foreman

pablomh · 2026-04-29T17:46:23Z

TopbarSweeper uses include Singleton with a shared controller attribute
across all Puma threads within each worker. Under concurrent requests, one
thread ensure block nils the controller while another thread is actively
using it:

Thread A                           Thread B
set_topbar_sweeper_controller:
  controller = self
  yield -> action runs...
                                   set_topbar_sweeper_controller:
                                     controller = self
                                     yield -> action runs...
                                   ensure:
                                     controller = nil  <- clobbers Thread A
  set_taxonomy -> expire_cache:
    controller.expire_fragment(...)
    -> NoMethodError: undefined method expire_fragment for nil:NilClass

The existing if controller.present? guard has a TOCTOU (time-of-check
to time-of-use) race: the check passes, then another thread nils the
value before expire_fragment is called.

This bug is hard to reproduce under typical load because the Singleton is
per-process (each Puma worker has its own), so only the threads within a
single worker compete. With the default 5 threads per worker, the race
window is extremely narrow. It becomes observable under high concurrency
(many simultaneous registrations) or with increased threads per worker,
where more threads share the same Singleton and the interleaving becomes
likely. When it does trigger, it produces a single 500 error that is
easily dismissed as transient noise in production logs.

Discovered during concurrent host registration performance testing.
The error appears in production.log at topbar_sweeper.rb:6.

Changes

Replace the Singleton with Thread.current storage, following the same
pattern already used by ThreadSession for User.current,
Organization.current, and Location.current. Each Puma thread gets
its own controller reference - no cross-thread clobbering.

app/services/topbar_sweeper.rb - Remove include Singleton, store
controller in Thread.current[:topbar_sweeper_controller], use safe
navigation (&.) instead of TOCTOU if controller.present? guard
app/controllers/concerns/foreman/controller/topbar_sweeper.rb -
.instance.controller -> .controller
app/models/concerns/topbar_cache_expiry.rb - same
test/unit/topbar_sweeper_test.rb - NEW: thread isolation and nil
safety tests

How to test

Prerequisites

Foreman environment with Puma in multi-threaded mode (default).

Steps

Run the new unit test:
bundle exec rake test TEST=test/unit/topbar_sweeper_test.rb
Run the existing integration test:
bundle exec rake test TEST=test/integration/top_bar_test.rb
To reproduce the original race condition: register many hosts
concurrently and grep for expire_fragment errors in production.log
- should no longer appear.

chris1984

Couple small comments, will let someone else look it over as well. @ShimShtein

TopbarSweeper uses a Singleton with a shared controller attribute across all Puma threads. Under concurrent requests, one thread ensure block nils the controller while another thread is still using it, causing NoMethodError on expire_fragment. Replace the Singleton with Thread.current storage, following the same pattern used by ThreadSession for User.current/Organization.current. Each thread gets its own controller reference - no cross-thread clobbering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chris1984

Thanks for adding the things I suggested and the test. Will let @ShimShtein give a final look.

ShimShtein · 2026-05-05T09:30:01Z

Overall the PR looks good. If I understand correctly, you have moved the controller cache handling into each thread.
My only question is: since there are much more threads than processes, did you notice a memory consumption degradation? Just from the top of my head, we are creating a reference that will exist to the whole lifetime of the thread, and I am not sure what object tree spans from that reference (nothing will be GC'ed under this tree). Such long-lived objects can create a certain strain on the GC (the objects become quite permanent in memory, not allowing the GC to properly reuse the pages.

pablomh · 2026-05-05T16:49:40Z

The Thread.current reference has the same lifecycle as the old Singleton attribute — set at request start (around_action), nil'd in ensure at request end. Nothing persists between requests. This is the same pattern ThreadSession uses for User.current and Organization.current.

We initially thought this was only observable under extreme concurrency, but looking more closely at the data, that's not quite right. We found it during concurrent host registration testing on a 16-CPU Satellite. At the default Puma config (24 workers × 5 threads = 120 slots), the race triggered at 760+ concurrent hosts. When we increased to 16 workers × 12 threads = 192 slots, it triggered at just 456 concurrent — more threads per worker widens the race window.

What's interesting is why this hasn't been seen before, given the Puma config hasn't changed. We think it's because our other registration performance patches (fact import reduction, Candlepin caching) made Puma significantly more efficient — requests complete faster, so threads cycle through the around_action ensure block more frequently, and more requests succeed and reach the finalize_registration → set_taxonomy → expire_cache code path where the race lives. The old inefficiency was accidentally masking the bug: requests were too slow and too many timed out to ever trigger the collision often enough to notice.

Regarding memory: comparing Puma RSS across multiple runs with and without this fix, we see a ~0.7 GB Puma RSS increase — but that's the combined effect of all patches in the batch (caching layers, connection pooling, this fix). The Thread.current change itself adds one pointer per thread that's nil between requests, so its contribution is negligible.

pr-processor Bot added the Not yet reviewed label Apr 29, 2026

pablomh force-pushed the fix/topbar-sweeper-thread-safety branch from f0b2351 to e64332d Compare April 29, 2026 17:54

chris1984 reviewed May 4, 2026

View reviewed changes

Comment thread test/unit/topbar_sweeper_test.rb

Comment thread app/models/concerns/topbar_cache_expiry.rb Outdated

pr-processor Bot removed the Not yet reviewed label May 4, 2026

pablomh force-pushed the fix/topbar-sweeper-thread-safety branch from e64332d to 1ccd744 Compare May 4, 2026 20:33

chris1984 approved these changes May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #39277 - Fix TopbarSweeper thread-safety race condition under Puma#10969

Fixes #39277 - Fix TopbarSweeper thread-safety race condition under Puma#10969
pablomh wants to merge 1 commit intotheforeman:developfrom
pablomh:fix/topbar-sweeper-thread-safety

pablomh commented Apr 29, 2026

Uh oh!

chris1984 left a comment

Uh oh!

Uh oh!

Uh oh!

chris1984 left a comment

Uh oh!

ShimShtein commented May 5, 2026

Uh oh!

pablomh commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pablomh commented Apr 29, 2026

Changes

How to test

Prerequisites

Steps

Uh oh!

chris1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chris1984 left a comment

Choose a reason for hiding this comment

Uh oh!

ShimShtein commented May 5, 2026

Uh oh!

pablomh commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants