Feat/hll dict size optimization clean#18144
Feat/hll dict size optimization clean#18144deeppatel710 wants to merge 8 commits intoapache:masterfrom
Conversation
|
@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks |
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal correctness issue; see inline comment.
| if (dictionary != null) { | ||
| int[] dictIds = blockValSet.getDictionaryIdsSV(); | ||
| getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0, length); | ||
| if (dictionary.length() > _dictSizeThreshold) { |
There was a problem hiding this comment.
This dictionary-size branch is not safe because the same aggregation result holder is reused across all scanned segments. If one segment stores a DictIdsWrapper and a later segment crosses the threshold, the next getHyperLogLog() call will cast the existing holder state to the wrong type and fail with ClassCastException. DISTINCTCOUNTHLL needs a stable holder representation here, or an explicit conversion when switching between bitmap-backed and direct-HLL aggregation.
There was a problem hiding this comment.
Thanks for the review, @xiangfu0! Great catch on the type-safety concern.
You're right that mixing the bitmap and HLL paths in the same result holder is unsafe. To be precise about when this can bite: within a single consuming (realtime) segment, the dictionary grows as new data is ingested. If one doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a subsequent batch on the same scan sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper to HyperLogLog and throw a ClassCastException.
The fix centralizes the defense in both getHyperLogLog() helpers — they now check instanceof DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6 aggregation paths since they all funnel through these two methods. Added a regression test that simulates the dictionary-growth scenario directly.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18144 +/- ##
============================================
+ Coverage 63.48% 63.50% +0.01%
Complexity 1627 1627
============================================
Files 3244 3244
Lines 197365 197386 +21
Branches 30540 30544 +4
============================================
+ Hits 125306 125358 +52
+ Misses 62019 61992 -27
+ Partials 10040 10036 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Jackie-Jiang
left a comment
There was a problem hiding this comment.
If the bottleneck is from poor performance of RoaringBitmap insertion, should we switch to a BitSet solution? Even with 1M cardinality, BitSet memory overhead is negligible (128KB). I believe the performance should be much better than directly aggregating HLL
|
@Jackie-Jiang The root cause of the RoaringBitmap slowdown for high-cardinality dicts isn't just insertion count — it's the container-type transition that happens at ~4096 entries (ArrayContainer → BitmapContainer), which involves an O(n) copy of all existing entries. This makes random high-cardinality insertions particularly painful. BitSet sidesteps this entirely: O(1) set() with no container management, and memory is a flat dictSize / 8 bytes (128 KB for a 1M-entry dict as you noted — negligible). Also realized this allows us to remove the dictSizeThreshold complexity from the previous iteration entirely. The threshold was meant to avoid bitmap overhead by falling back to direct-HLL for large dicts but since HLL uses max-register semantics, offering the same hash twice has no effect. BitSet deduplication and direct-HLL insertion produce identical accuracy, so the optimization is always worthwhile with no Updated in the latest commit:
The Pinot Binary Compatibility Check failure is pre-existing on master and affects all open PRs currently (e.g. #18189). It's related to ColumnStatistics/MapColumnStatistics changes in pinot-segment-spi, which this PR doesn't touch. |
…high-cardinality columns For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values), DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap before converting to HLL at finalization. This mirrors the performance issue originally reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411). This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000). When the dictionary size exceeds the threshold, dictionary values are offered directly to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL already produces an approximate result, bitmap deduplication is not needed for correctness in high-cardinality scenarios — HLL handles duplicate offers gracefully. The optimization applies to all aggregation paths: - Non-group-by SV and MV - Group-by SV (both SV and MV group keys) - Group-by MV (both SV and MV group keys) Usage: DISTINCTCOUNTHLL(col) -- default threshold (100K) DISTINCTCOUNTHLL(col, 12) -- custom log2m, default threshold DISTINCTCOUNTHLL(col, 12, 50000) -- custom log2m and threshold DISTINCTCOUNTHLL(col, 12, 0) -- disable optimization (threshold = MAX_VALUE) Expected speedup for high-cardinality columns: 4x-10x, consistent with the benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n DISTINCTCOUNTHLL In consuming (realtime) segments, the dictionary grows during ingestion. If a prior block used the bitmap path (dict size below threshold) and a later block sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper result to HyperLogLog and throw ClassCastException. Fix: check instanceof DictIdsWrapper in both getHyperLogLog() helpers and convert to HyperLogLog before returning, so the holder type always stays consistent regardless of when the threshold is crossed. Also adds a regression test that simulates this exact scenario. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch DictIdsWrapper from RoaringBitmap to java.util.BitSet for deduplicating dictionary IDs before offering to HyperLogLog. BitSet gives O(1) set operations with no container-type transition overhead (RoaringBitmap transitions ArrayContainer→BitmapContainer at ~4096 entries, causing O(n) copies for random high-cardinality inserts). Memory overhead is dictSize/8 bytes (e.g. 128 KB for a 1M-entry dict), which is negligible compared to the segment data already in memory. This also removes the _dictSizeThreshold complexity introduced in the previous iteration. The threshold was added to avoid the RoaringBitmap overhead for large dicts by falling back to direct-HLL, but since HLL is idempotent (max-register semantics), deduplication via BitSet and direct-HLL insertion produce identical accuracy. With BitSet, the optimization is always beneficial and needs no threshold knob. Changes: - DictIdsWrapper now holds a BitSet instead of a RoaringBitmap - Removed _dictSizeThreshold, getDictSizeThreshold(), and the 3-arg constructor; function signature reverts to 1 or 2 arguments - Removed all threshold-branch logic from 6 aggregation methods - getDictIdBitmap() helper renamed to getDictIdBitSet() - convertToHyperLogLog() uses BitSet.nextSetBit() iteration - Updated tests: removed threshold/bypass tests, added BitSet deduplication correctness, large-dict, and multi-batch accumulation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…of per-group BitSet BitSet deduplication in aggregateSVGroupBySV/aggregateMVGroupBySV/aggregateSVGroupByMV/aggregateMVGroupByMV was allocating one BitSet(dictSize/8 bytes) per group. With 100K groups × 3M-entry dict → 37.5 GB → OOM. HyperLogLog uses max-register semantics so duplicate offer() calls are no-ops — deduplication before HLL is unnecessary for accuracy. For group-by, offers values directly to HLL. BitSet deduplication is kept only for the non-group-by aggregation path (one BitSet per segment, bounded and cheap). Also removes the now-unused getDictIdBitSet(GroupByResultHolder, int, Dictionary) helper and simplifies extractGroupByResult to remove the DictIdsWrapper branch that can no longer occur. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
271b095 to
86fd8d7
Compare
…erLogLog per group HyperLogLog(log2m=14) pre-allocates ~16KB of registers upfront per group, regardless of how many distinct values that group actually sees. With many groups (e.g. numGroupsLimit=MAX_INT) and a high-cardinality group-by key, this causes OOM before the CPU kill threshold is reached. Restore the group-by dict-encoded path to use a RoaringBitmap (via new GroupByDictIdsWrapper), which is sparse: memory is proportional to the number of distinct dict IDs seen per group (~2 bytes each), not to 2^log2m. At extraction time, convert to HyperLogLog by iterating the bitmap once. The BitSet optimization (DictIdsWrapper) is preserved for the non-group-by aggregation path, where a single BitSet per segment is bounded and cheap (dictSize/8 bytes regardless of numGroups). Summary of design: - Non-group-by aggregation: ONE BitSet across full dict (fast, bounded) - Group-by: ONE RoaringBitmap per group (sparse, scales with cardinality) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fixes the performance bottleneck in
DISTINCTCOUNTHLLfor dictionary-encoded columns with high cardinality (reported in #17336).DISTINCTCOUNTHLLcurrently always accumulates dictionary IDs into aRoaringBitmapbefore converting to HLL at finalization. For high-cardinality columns(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.
This adds an optional third argument
dictSizeThreshold(default:100,000). Whendictionary.length() > dictSizeThreshold, dictionary values are offereddirectly to the HyperLogLog, bypassing the bitmap entirely. Since
DISTINCTCOUNTHLLalready returns an approximate result, exact bitmap deduplication isunnecessary — HLL handles duplicate offers gracefully.
The same root cause was fixed for
DISTINCT_COUNT_SMART_HLLin #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PRapplies the equivalent optimization to the more commonly used
DISTINCTCOUNTHLLfunction.Usage