Feat/hll dict size optimization clean by deeppatel710 · Pull Request #18144 · apache/pinot

deeppatel710 · 2026-04-09T07:28:41Z

Summary

Fixes the performance bottleneck in DISTINCTCOUNTHLL for dictionary-encoded columns with high cardinality (reported in #17336).

DISTINCTCOUNTHLL currently always accumulates dictionary IDs into a RoaringBitmap before converting to HLL at finalization. For high-cardinality columns
(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.

This adds an optional third argument dictSizeThreshold (default: 100,000). When dictionary.length() > dictSizeThreshold, dictionary values are offered
directly to the HyperLogLog, bypassing the bitmap entirely. Since DISTINCTCOUNTHLL already returns an approximate result, exact bitmap deduplication is
unnecessary — HLL handles duplicate offers gracefully.

The same root cause was fixed for DISTINCT_COUNT_SMART_HLL in #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PR
applies the equivalent optimization to the more commonly used DISTINCTCOUNTHLL function.

Usage

DISTINCTCOUNTHLL(col)             -- default threshold (100K)
DISTINCTCOUNTHLL(col, 12)         -- custom log2m, default threshold
DISTINCTCOUNTHLL(col, 12, 50000)  -- custom log2m and threshold     
DISTINCTCOUNTHLL(col, 12, 0)      -- disable optimization (threshold = Integer.MAX_VALUE)                                                                      

Changes                                                                                                                                                        
                                                          
- DistinctCountHLLAggregationFunction: added DEFAULT_DICT_SIZE_THRESHOLD = 100_000, _dictSizeThreshold field, optional 3rd constructor arg, and direct-HLL path
 in all 6 aggregation paths (non-group-by SV/MV, group-by SV/MV with SV/MV group keys)
- Tests: added 4 new test cases covering parameter defaults, custom threshold, disabled optimization, and correctness of direct-HLL path vs bitmap path        
                                                                                                                                                               
Test plan
                                                                                                                                                               
- All existing HLL tests pass                                                                                                                                  
- New tests verify parameter parsing and approximate correctness of direct-HLL path
- Manual benchmark against high-cardinality column (expected ~4-10x speedup matching #17411 results)

deeppatel710 · 2026-04-09T07:29:06Z

@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks

xiangfu0

Found one high-signal correctness issue; see inline comment.

xiangfu0 · 2026-04-11T13:02:57Z

    if (dictionary != null) {
      int[] dictIds = blockValSet.getDictionaryIdsSV();
-      getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0, length);
+      if (dictionary.length() > _dictSizeThreshold) {


This dictionary-size branch is not safe because the same aggregation result holder is reused across all scanned segments. If one segment stores a DictIdsWrapper and a later segment crosses the threshold, the next getHyperLogLog() call will cast the existing holder state to the wrong type and fail with ClassCastException. DISTINCTCOUNTHLL needs a stable holder representation here, or an explicit conversion when switching between bitmap-backed and direct-HLL aggregation.

Thanks for the review, @xiangfu0! Great catch on the type-safety concern.

You're right that mixing the bitmap and HLL paths in the same result holder is unsafe. To be precise about when this can bite: within a single consuming (realtime) segment, the dictionary grows as new data is ingested. If one doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a subsequent batch on the same scan sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper to HyperLogLog and throw a ClassCastException.
The fix centralizes the defense in both getHyperLogLog() helpers — they now check instanceof DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6 aggregation paths since they all funnel through these two methods. Added a regression test that simulates the dictionary-growth scenario directly.

codecov-commenter · 2026-04-12T08:09:17Z

Codecov Report

❌ Patch coverage is 86.04651% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.50%. Comparing base (3eff094) to head (3eca00d).

Files with missing lines	Patch %	Lines
.../function/DistinctCountHLLAggregationFunction.java	86.04%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18144      +/-   ##
============================================
+ Coverage     63.48%   63.50%   +0.01%     
  Complexity     1627     1627              
============================================
  Files          3244     3244              
  Lines        197365   197386      +21     
  Branches      30540    30544       +4     
============================================
+ Hits         125306   125358      +52     
+ Misses        62019    61992      -27     
+ Partials      10040    10036       -4

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.46% <86.04%> (+<0.01%)`	⬆️
java-21	`63.48% <86.04%> (+0.01%)`	⬆️
temurin	`63.50% <86.04%> (+0.01%)`	⬆️
unittests	`63.50% <86.04%> (+0.01%)`	⬆️
unittests1	`55.49% <86.04%> (+0.03%)`	⬆️
unittests2	`34.97% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Jackie-Jiang

If the bottleneck is from poor performance of RoaringBitmap insertion, should we switch to a BitSet solution? Even with 1M cardinality, BitSet memory overhead is negligible (128KB). I believe the performance should be much better than directly aggregating HLL

deeppatel710 · 2026-04-14T05:40:42Z

@Jackie-Jiang
Thanks for the suggestion to switch to BitSet — after thinking through this more carefully, it's clearly the right call.

The root cause of the RoaringBitmap slowdown for high-cardinality dicts isn't just insertion count — it's the container-type transition that happens at ~4096 entries (ArrayContainer → BitmapContainer), which involves an O(n) copy of all existing entries. This makes random high-cardinality insertions particularly painful.

BitSet sidesteps this entirely: O(1) set() with no container management, and memory is a flat dictSize / 8 bytes (128 KB for a 1M-entry dict as you noted — negligible).

Also realized this allows us to remove the dictSizeThreshold complexity from the previous iteration entirely. The threshold was meant to avoid bitmap overhead by falling back to direct-HLL for large dicts but since HLL uses max-register semantics, offering the same hash twice has no effect. BitSet deduplication and direct-HLL insertion produce identical accuracy, so the optimization is always worthwhile with no
threshold knob needed.

Updated in the latest commit:

DictIdsWrapper now uses java.util.BitSet instead of RoaringBitmap
Function signature reverts to 1–2 arguments (column + optional log2m)
All threshold branches and the 3-arg constructor removed
Tests updated: removed threshold/bypass tests, added BitSet deduplication correctness, large-dict accuracy, and multi-batch accumulation tests

The Pinot Binary Compatibility Check failure is pre-existing on master and affects all open PRs currently (e.g. #18189). It's related to ColumnStatistics/MapColumnStatistics changes in pinot-segment-spi, which this PR doesn't touch.

…high-cardinality columns For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values), DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap before converting to HLL at finalization. This mirrors the performance issue originally reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411). This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000). When the dictionary size exceeds the threshold, dictionary values are offered directly to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL already produces an approximate result, bitmap deduplication is not needed for correctness in high-cardinality scenarios — HLL handles duplicate offers gracefully. The optimization applies to all aggregation paths: - Non-group-by SV and MV - Group-by SV (both SV and MV group keys) - Group-by MV (both SV and MV group keys) Usage: DISTINCTCOUNTHLL(col) -- default threshold (100K) DISTINCTCOUNTHLL(col, 12) -- custom log2m, default threshold DISTINCTCOUNTHLL(col, 12, 50000) -- custom log2m and threshold DISTINCTCOUNTHLL(col, 12, 0) -- disable optimization (threshold = MAX_VALUE) Expected speedup for high-cardinality columns: 4x-10x, consistent with the benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n DISTINCTCOUNTHLL In consuming (realtime) segments, the dictionary grows during ingestion. If a prior block used the bitmap path (dict size below threshold) and a later block sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper result to HyperLogLog and throw ClassCastException. Fix: check instanceof DictIdsWrapper in both getHyperLogLog() helpers and convert to HyperLogLog before returning, so the holder type always stays consistent regardless of when the threshold is crossed. Also adds a regression test that simulates this exact scenario. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ng test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Switch DictIdsWrapper from RoaringBitmap to java.util.BitSet for deduplicating dictionary IDs before offering to HyperLogLog. BitSet gives O(1) set operations with no container-type transition overhead (RoaringBitmap transitions ArrayContainer→BitmapContainer at ~4096 entries, causing O(n) copies for random high-cardinality inserts). Memory overhead is dictSize/8 bytes (e.g. 128 KB for a 1M-entry dict), which is negligible compared to the segment data already in memory. This also removes the _dictSizeThreshold complexity introduced in the previous iteration. The threshold was added to avoid the RoaringBitmap overhead for large dicts by falling back to direct-HLL, but since HLL is idempotent (max-register semantics), deduplication via BitSet and direct-HLL insertion produce identical accuracy. With BitSet, the optimization is always beneficial and needs no threshold knob. Changes: - DictIdsWrapper now holds a BitSet instead of a RoaringBitmap - Removed _dictSizeThreshold, getDictSizeThreshold(), and the 3-arg constructor; function signature reverts to 1 or 2 arguments - Removed all threshold-branch logic from 6 aggregation methods - getDictIdBitmap() helper renamed to getDictIdBitSet() - convertToHyperLogLog() uses BitSet.nextSetBit() iteration - Updated tests: removed threshold/bypass tests, added BitSet deduplication correctness, large-dict, and multi-batch accumulation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…of per-group BitSet BitSet deduplication in aggregateSVGroupBySV/aggregateMVGroupBySV/aggregateSVGroupByMV/aggregateMVGroupByMV was allocating one BitSet(dictSize/8 bytes) per group. With 100K groups × 3M-entry dict → 37.5 GB → OOM. HyperLogLog uses max-register semantics so duplicate offer() calls are no-ops — deduplication before HLL is unnecessary for accuracy. For group-by, offers values directly to HLL. BitSet deduplication is kept only for the non-group-by aggregation path (one BitSet per segment, bounded and cheap). Also removes the now-unused getDictIdBitSet(GroupByResultHolder, int, Dictionary) helper and simplifies extractGroupByResult to remove the DictIdsWrapper branch that can no longer occur. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erLogLog per group HyperLogLog(log2m=14) pre-allocates ~16KB of registers upfront per group, regardless of how many distinct values that group actually sees. With many groups (e.g. numGroupsLimit=MAX_INT) and a high-cardinality group-by key, this causes OOM before the CPU kill threshold is reached. Restore the group-by dict-encoded path to use a RoaringBitmap (via new GroupByDictIdsWrapper), which is sparse: memory is proportional to the number of distinct dict IDs seen per group (~2 bytes each), not to 2^log2m. At extraction time, convert to HyperLogLog by iterating the bitmap once. The BitSet optimization (DictIdsWrapper) is preserved for the non-group-by aggregation path, where a single BitSet per segment is bounded and cheap (dictSize/8 bytes regardless of numGroups). Summary of design: - Non-group-by aggregation: ONE BitSet across full dict (fast, bounded) - Group-by: ONE RoaringBitmap per group (sparse, scales with cardinality) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

xiangfu0 reviewed Apr 11, 2026

View reviewed changes

Jackie-Jiang added the enhancement Improvement to existing functionality label Apr 11, 2026

Jackie-Jiang reviewed Apr 13, 2026

View reviewed changes

Deep Patel and others added 7 commits April 19, 2026 19:10

docs: clarify dictSizeThreshold default value rationale in comment

cac3eba

Improve test: remove fragile class-name assertion in threshold-crossi…

9629ff4

…ng test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: retrigger integration tests (flaky test in Set 1 temurin-11)

ab75335

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

deeppatel710 force-pushed the feat/hll-dict-size-optimization-clean branch from 271b095 to 86fd8d7 Compare April 20, 2026 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/hll dict size optimization clean#18144

Feat/hll dict size optimization clean#18144
deeppatel710 wants to merge 8 commits intoapache:masterfrom
deeppatel710:feat/hll-dict-size-optimization-clean

deeppatel710 commented Apr 9, 2026

Uh oh!

deeppatel710 commented Apr 9, 2026

Uh oh!

xiangfu0 left a comment

Uh oh!

xiangfu0 Apr 11, 2026

Uh oh!

deeppatel710 Apr 12, 2026

Uh oh!

codecov-commenter commented Apr 12, 2026 •

edited

Loading

Uh oh!

Jackie-Jiang left a comment

Uh oh!

deeppatel710 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

deeppatel710 commented Apr 9, 2026

Summary

Usage

Uh oh!

deeppatel710 commented Apr 9, 2026

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

deeppatel710 Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

deeppatel710 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 12, 2026 •

edited

Loading