SPANN SelectHead: opt-in parallel BKT build#450
Open
fastio wants to merge 2 commits into
Open
Conversation
Two-phase parallel BKTree build to speed up SelectHead on large data: - Phase 1 (top): few large nodes use the existing intra-node data-parallel KmeansAssign (all threads). - Phase 2 (bottom): many small subtrees (<= cutoff) built in parallel, one worker per subtree (per-worker KmeansArgs, _TH=1), sharing the single N-sized label buffer via disjoint [first,last) writes. Concurrency hazards addressed: - Utils::rand -> thread_local std::mt19937 (std::rand() was racy under threads). - m_pTreeRoots structural writes guarded by m_treeLock; clustering is lock-free. - m_pSampleCenterMap writes guarded by m_sampleMapLock. - KmeansAssign gains a _TH==1 serial fast-path (no per-node thread spawn). Gated by [SelectHead] ParallelBuildBKT (default false = original serial path). Not bit-reproducible vs serial (RNG/scheduling/FP order); verify by recall + TSan.
Make the non-TBB ConcurrentQueue/Set/Map fallbacks expose the TBB API surface used by ExtraFileController/ExtraDynamicSearcher (unsafe_size, empty, unsafe_begin/end, value_type, range iteration). Force the LoggerHolder pre-C++20 path so std::atomic<std::shared_ptr<Logger>> never instantiates under libc++. Drop an unused omp.h include and pass .c_str() to a SPTAGLIB_LOG variadic call.
Contributor
|
@fastio please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
For billion-scale datasets, BKT construction in SelectHead dominates index
build time. The existing build is a serial node schedule where only
KmeansAssigninside a single node is data-parallel. That works well near theroot (few large nodes) but leaves most cores idle once the tree fans out into
many small subtrees.
Design
Two-phase parallel build, gated by a new
[SelectHead] ParallelBuildBKToption(default
false, i.e. the existing serial path is untouched unless opted in):the existing intra-node data-parallel
KmeansAssignusing all threads. Thefew large nodes near the root are exactly where data parallelism is
effective.
(
max(BKTLeafSize * 4, N / (threads * 16) + 1)), it is deferred and laterbuilt whole by a single worker. Workers pull deferred subtrees from a shared
queue (
fetch_add), each with its ownKmeansArgsand_TH = 1, so themany small nodes are processed without per-node thread spawning.
All workers share the single N-sized
labelbuffer: sibling nodes own disjoint[first, last]ranges oflocalindices(including the center slot atlast),so writes never overlap.
Concurrency changes
Utils::randnow uses athread_local std::mt19937.std::rand()shareshidden global state and races under concurrent clustering. Note this also
affects the serial path: previously
std::rand()withoutsrandproducedthe same sequence on every run; builds are now seeded fro
std::random_deviceand are no longer deterministic run-to-run. Quality isequivalent; verification should compare recall, not bits.
m_pTreeRootsare guarded by a mutex; writes tom_pSampleCenterMapby another. The clustering itself isKmeansAssigngains a_TH == 1serial fast path so Phase 2 workers do notspawn/join a thread per node.
Compatibility
floating-point summation order differ); tree quality is e
Also included
A second commit with portability fixes needed to consume SP
under a strict libc++ toolchain at C++23:
ConcurrentQueue/ConcurrentSet/ConcurrentMapfallbacks nowexpose the TBB API surface that
ExtraFileController/Ex already use (unsafe_size,empty,unsafe_begin/end,value_type`,range iteration), so the no-TBB configuration compiles ag
LoggerHolderalways takes thestd::atomic_load/std::atomic_storepath:the C++20
std::atomic<std::shared_ptr<Logger>>specialiby libc++ and the pre-C++20 path is valid under both standards.
.c_str()to aSPTAGLIB_LOGvarargs call inAppr (passingstd::stringthrough...` is undefined behavior).<omp.h>include fromTxtReader.cpp.