Skip to content

fix: address concurrency races in stats aggregation#16

Merged
mjfelis merged 1 commit into
mainfrom
fix/stats-concurrency
Jun 1, 2026
Merged

fix: address concurrency races in stats aggregation#16
mjfelis merged 1 commit into
mainfrom
fix/stats-concurrency

Conversation

@mjfelis

@mjfelis mjfelis commented Jun 1, 2026

Copy link
Copy Markdown
Member

Summary

Fixes several concurrency / correctness issues found during a review of the stat ingest + aggregation path. Scoped to the low-risk, high-value set; compiles cleanly (./mvnw clean compile).

Fixes

1. Lost-update race in StatBuilderService.getStat (data loss)

When two threads created the same statClass/entity/granularity/ts bucket concurrently, the race loser returned its own orphaned Stat instance (never put in the cache) instead of the winner already in the map. Its increments were silently dropped at flush. Also the } { block was an unconditional block, not an else, so the "Not CREATED" warning always fired. Now the loser reuses the cached raceStat.

2. Startup ordering NPE race in StatController.onStart

setStartedService(true) ran synchronously while statService.init(tz) (which allocates the static stats map) ran asynchronously inside startPurgeTask(). A consumer could observe startedService==true and call getStat() while stats was still null → NPE. onStart now resolves the timezone and calls init(tz) before setStartedService(true). init() is now synchronized and idempotent (only allocates if null) so the later call from the purge task is harmless.

3. NumberFormat thread-safety in StatReaderService

NumberFormat is not thread-safe and was a shared static used by concurrent /stats/view and /stats/compare requests. Replaced with a ThreadLocal<NumberFormat>.

4. Missing volatile on cross-thread lifecycle flags

startedService (StatBuilderService), startedPurge and finished (StatController) are written/read across threads outside synchronized blocks; finished is spun on during shutdown. Marked volatile to guarantee visibility (fixes a potential shutdown hang).

Not included (tracked for a separate, more careful change)

  • Shared EntityManager used across consumer/flush/reader threads.
  • Flush vs increment lock-identity desync via removeLockObj.
  • flushStats(statClass, entityId) not holding the per-stat lock.
  • Null stat NPE swallowed in flushStats.
  • StatServerResource.computeGranularity dereferencing from/to before the null checks (returns 500 instead of 400).

Test plan

  • ./mvnw clean compile passes.
  • No behavior change to the public REST contract.

- getStat: use cached race winner instead of orphaned instance (fixes silent counter loss); correct the broken if/else block
- startup: init timezone + stats map before setStartedService(true) to avoid NPE race; make init() idempotent and synchronized
- StatReaderService: use ThreadLocal NumberFormat (not thread-safe across concurrent REST threads)
- mark lifecycle flags volatile (startedService, startedPurge, finished)
@mjfelis mjfelis merged commit d440e6e into main Jun 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant