Skip to content

Feat/hll dict size optimization clean#18144

Open
deeppatel710 wants to merge 8 commits intoapache:masterfrom
deeppatel710:feat/hll-dict-size-optimization-clean
Open

Feat/hll dict size optimization clean#18144
deeppatel710 wants to merge 8 commits intoapache:masterfrom
deeppatel710:feat/hll-dict-size-optimization-clean

Conversation

@deeppatel710
Copy link
Copy Markdown
Contributor

Summary

Fixes the performance bottleneck in DISTINCTCOUNTHLL for dictionary-encoded columns with high cardinality (reported in #17336).

DISTINCTCOUNTHLL currently always accumulates dictionary IDs into a RoaringBitmap before converting to HLL at finalization. For high-cardinality columns
(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.

This adds an optional third argument dictSizeThreshold (default: 100,000). When dictionary.length() > dictSizeThreshold, dictionary values are offered
directly to the HyperLogLog, bypassing the bitmap entirely. Since DISTINCTCOUNTHLL already returns an approximate result, exact bitmap deduplication is
unnecessary — HLL handles duplicate offers gracefully.

The same root cause was fixed for DISTINCT_COUNT_SMART_HLL in #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PR
applies the equivalent optimization to the more commonly used DISTINCTCOUNTHLL function.

Usage

DISTINCTCOUNTHLL(col)             -- default threshold (100K)
DISTINCTCOUNTHLL(col, 12)         -- custom log2m, default threshold
DISTINCTCOUNTHLL(col, 12, 50000)  -- custom log2m and threshold     
DISTINCTCOUNTHLL(col, 12, 0)      -- disable optimization (threshold = Integer.MAX_VALUE)                                                                      

Changes                                                                                                                                                        
                                                          
- DistinctCountHLLAggregationFunction: added DEFAULT_DICT_SIZE_THRESHOLD = 100_000, _dictSizeThreshold field, optional 3rd constructor arg, and direct-HLL path
 in all 6 aggregation paths (non-group-by SV/MV, group-by SV/MV with SV/MV group keys)
- Tests: added 4 new test cases covering parameter defaults, custom threshold, disabled optimization, and correctness of direct-HLL path vs bitmap path        
                                                                                                                                                               
Test plan
                                                                                                                                                               
- All existing HLL tests pass                                                                                                                                  
- New tests verify parameter parsing and approximate correctness of direct-HLL path
- Manual benchmark against high-cardinality column (expected ~4-10x speedup matching #17411 results)  

@deeppatel710
Copy link
Copy Markdown
Contributor Author

@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal correctness issue; see inline comment.

if (dictionary != null) {
int[] dictIds = blockValSet.getDictionaryIdsSV();
getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0, length);
if (dictionary.length() > _dictSizeThreshold) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dictionary-size branch is not safe because the same aggregation result holder is reused across all scanned segments. If one segment stores a DictIdsWrapper and a later segment crosses the threshold, the next getHyperLogLog() call will cast the existing holder state to the wrong type and fail with ClassCastException. DISTINCTCOUNTHLL needs a stable holder representation here, or an explicit conversion when switching between bitmap-backed and direct-HLL aggregation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @xiangfu0! Great catch on the type-safety concern.

You're right that mixing the bitmap and HLL paths in the same result holder is unsafe. To be precise about when this can bite: within a single consuming (realtime) segment, the dictionary grows as new data is ingested. If one doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a subsequent batch on the same scan sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper to HyperLogLog and throw a ClassCastException.
The fix centralizes the defense in both getHyperLogLog() helpers — they now check instanceof DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6 aggregation paths since they all funnel through these two methods. Added a regression test that simulates the dictionary-growth scenario directly.

@Jackie-Jiang Jackie-Jiang added the enhancement Improvement to existing functionality label Apr 11, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 86.04651% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.50%. Comparing base (3eff094) to head (3eca00d).

Files with missing lines Patch % Lines
.../function/DistinctCountHLLAggregationFunction.java 86.04% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18144      +/-   ##
============================================
+ Coverage     63.48%   63.50%   +0.01%     
  Complexity     1627     1627              
============================================
  Files          3244     3244              
  Lines        197365   197386      +21     
  Branches      30540    30544       +4     
============================================
+ Hits         125306   125358      +52     
+ Misses        62019    61992      -27     
+ Partials      10040    10036       -4     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.46% <86.04%> (+<0.01%) ⬆️
java-21 63.48% <86.04%> (+0.01%) ⬆️
temurin 63.50% <86.04%> (+0.01%) ⬆️
unittests 63.50% <86.04%> (+0.01%) ⬆️
unittests1 55.49% <86.04%> (+0.03%) ⬆️
unittests2 34.97% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bottleneck is from poor performance of RoaringBitmap insertion, should we switch to a BitSet solution? Even with 1M cardinality, BitSet memory overhead is negligible (128KB). I believe the performance should be much better than directly aggregating HLL

@deeppatel710
Copy link
Copy Markdown
Contributor Author

@Jackie-Jiang
Thanks for the suggestion to switch to BitSet — after thinking through this more carefully, it's clearly the right call.

The root cause of the RoaringBitmap slowdown for high-cardinality dicts isn't just insertion count — it's the container-type transition that happens at ~4096 entries (ArrayContainer → BitmapContainer), which involves an O(n) copy of all existing entries. This makes random high-cardinality insertions particularly painful.

BitSet sidesteps this entirely: O(1) set() with no container management, and memory is a flat dictSize / 8 bytes (128 KB for a 1M-entry dict as you noted — negligible).

Also realized this allows us to remove the dictSizeThreshold complexity from the previous iteration entirely. The threshold was meant to avoid bitmap overhead by falling back to direct-HLL for large dicts but since HLL uses max-register semantics, offering the same hash twice has no effect. BitSet deduplication and direct-HLL insertion produce identical accuracy, so the optimization is always worthwhile with no
threshold knob needed.

Updated in the latest commit:

  • DictIdsWrapper now uses java.util.BitSet instead of RoaringBitmap
  • Function signature reverts to 1–2 arguments (column + optional log2m)
  • All threshold branches and the 3-arg constructor removed
  • Tests updated: removed threshold/bypass tests, added BitSet deduplication correctness, large-dict accuracy, and multi-batch accumulation tests

The Pinot Binary Compatibility Check failure is pre-existing on master and affects all open PRs currently (e.g. #18189). It's related to ColumnStatistics/MapColumnStatistics changes in pinot-segment-spi, which this PR doesn't touch.

Deep Patel and others added 7 commits April 19, 2026 19:10
…high-cardinality columns

For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values),
DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap
before converting to HLL at finalization. This mirrors the performance issue originally
reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411).

This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000).
When the dictionary size exceeds the threshold, dictionary values are offered directly
to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL
already produces an approximate result, bitmap deduplication is not needed for correctness
in high-cardinality scenarios — HLL handles duplicate offers gracefully.

The optimization applies to all aggregation paths:
- Non-group-by SV and MV
- Group-by SV (both SV and MV group keys)
- Group-by MV (both SV and MV group keys)

Usage:
  DISTINCTCOUNTHLL(col)               -- default threshold (100K)
  DISTINCTCOUNTHLL(col, 12)           -- custom log2m, default threshold
  DISTINCTCOUNTHLL(col, 12, 50000)    -- custom log2m and threshold
  DISTINCTCOUNTHLL(col, 12, 0)        -- disable optimization (threshold = MAX_VALUE)

Expected speedup for high-cardinality columns: 4x-10x, consistent with the
benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n DISTINCTCOUNTHLL

In consuming (realtime) segments, the dictionary grows during ingestion. If a
prior block used the bitmap path (dict size below threshold) and a later block
sees the grown dictionary above the threshold, getHyperLogLog() would cast the
existing DictIdsWrapper result to HyperLogLog and throw ClassCastException.

Fix: check instanceof DictIdsWrapper in both getHyperLogLog() helpers and
convert to HyperLogLog before returning, so the holder type always stays
consistent regardless of when the threshold is crossed.

Also adds a regression test that simulates this exact scenario.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch DictIdsWrapper from RoaringBitmap to java.util.BitSet for
deduplicating dictionary IDs before offering to HyperLogLog.

BitSet gives O(1) set operations with no container-type transition
overhead (RoaringBitmap transitions ArrayContainer→BitmapContainer at
~4096 entries, causing O(n) copies for random high-cardinality inserts).
Memory overhead is dictSize/8 bytes (e.g. 128 KB for a 1M-entry dict),
which is negligible compared to the segment data already in memory.

This also removes the _dictSizeThreshold complexity introduced in the
previous iteration. The threshold was added to avoid the RoaringBitmap
overhead for large dicts by falling back to direct-HLL, but since HLL
is idempotent (max-register semantics), deduplication via BitSet and
direct-HLL insertion produce identical accuracy. With BitSet, the
optimization is always beneficial and needs no threshold knob.

Changes:
- DictIdsWrapper now holds a BitSet instead of a RoaringBitmap
- Removed _dictSizeThreshold, getDictSizeThreshold(), and the 3-arg
  constructor; function signature reverts to 1 or 2 arguments
- Removed all threshold-branch logic from 6 aggregation methods
- getDictIdBitmap() helper renamed to getDictIdBitSet()
- convertToHyperLogLog() uses BitSet.nextSetBit() iteration
- Updated tests: removed threshold/bypass tests, added BitSet
  deduplication correctness, large-dict, and multi-batch accumulation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…of per-group BitSet

BitSet deduplication in aggregateSVGroupBySV/aggregateMVGroupBySV/aggregateSVGroupByMV/aggregateMVGroupByMV
was allocating one BitSet(dictSize/8 bytes) per group. With 100K groups × 3M-entry dict → 37.5 GB → OOM.

HyperLogLog uses max-register semantics so duplicate offer() calls are no-ops — deduplication before HLL is
unnecessary for accuracy. For group-by, offers values directly to HLL. BitSet deduplication is kept only for
the non-group-by aggregation path (one BitSet per segment, bounded and cheap).

Also removes the now-unused getDictIdBitSet(GroupByResultHolder, int, Dictionary) helper and simplifies
extractGroupByResult to remove the DictIdsWrapper branch that can no longer occur.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@deeppatel710 deeppatel710 force-pushed the feat/hll-dict-size-optimization-clean branch from 271b095 to 86fd8d7 Compare April 20, 2026 02:12
…erLogLog per group

HyperLogLog(log2m=14) pre-allocates ~16KB of registers upfront per group, regardless of how many
distinct values that group actually sees. With many groups (e.g. numGroupsLimit=MAX_INT) and a
high-cardinality group-by key, this causes OOM before the CPU kill threshold is reached.

Restore the group-by dict-encoded path to use a RoaringBitmap (via new GroupByDictIdsWrapper),
which is sparse: memory is proportional to the number of distinct dict IDs seen per group (~2 bytes
each), not to 2^log2m. At extraction time, convert to HyperLogLog by iterating the bitmap once.

The BitSet optimization (DictIdsWrapper) is preserved for the non-group-by aggregation path, where
a single BitSet per segment is bounded and cheap (dictSize/8 bytes regardless of numGroups).

Summary of design:
- Non-group-by aggregation: ONE BitSet across full dict (fast, bounded)
- Group-by: ONE RoaringBitmap per group (sparse, scales with cardinality)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants