Hublabel by electricEpilith · Pull Request #4870 · vgteam/vg

electricEpilith · 2026-04-07T00:15:52Z

Changelog Entry

To be copied to the draft changelog by merger:

Add hub labeling to distance index, which allows efficient exact shortest distance queries even in "oversized" snarls
Bug fix for minimizer, significant speed improvement

Description

Adds hub labeling functionality to the snarl distance index.

…++20 upgrade

…he wrong thing

… get a wrong answer

…bbdsg that makes labels that can fit

…d asserts

This reverts commit 5436d73.

…raph objects

…indexing test cases

…he middle of chains

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

cache_payloads was single-threaded despite the -t flag; with 164M nodes on an HPRC graph it hung for hours. Two fixes: 1. Pass `true` to for_each_handle to enable OpenMP parallelism; guard the non-thread-safe writes (oversized_zipcodes vector and node_id_to_payload map) with named omp critical sections. 2. Call distance_index->preload(true) immediately before cache_payloads in build_minimizer_index. find_frequent_kmers runs for ~3300 s before this point and evicts the mmap'd index pages, causing a page fault on every snarl-tree lookup in fill_in_zipcode_from_pos. Reloading here ensures the index is warm when the parallel loop starts. Also add a depth guard (abort at >10000) in fill_in_zipcode_from_pos to catch any future infinite loops in the snarl tree traversal. Also use distance_index.get_snarl_child_count() (O(1) record read) instead of for_each_child iteration in get_regular/irregular_snarl_code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Parallel payload caching

Track regualrity through distance index

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

passes all tests run by `make test` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…o support -std=c++20

planned out by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: GitHub Copilot <noreply@github.com>

This reverts commit 7920c76.

This reverts commit b81e331.

initial plan by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: GitHub Copilot <noreply@github.com>

This reverts commit 6cf266c.

Hublabel

adamnovak

Here's my review, including a bunch of stuff I now want to change about code I committed.

The changes to libbdsg also need to be reviewed.

adamnovak · 2026-04-07T21:39:43Z

src/subcommand/minimizer_main.cpp

+        // re-preload right before cache_payloads. The double-preload is
+        // necessary: a single preload just before cache_payloads isn't enough
+        // to keep the index resident under the memory pressure of 32 parallel
+        // threads and the remaining in-memory data structures.


I don't think that's how page caching works; if it's paged in it's paged in, right? It can't possibly get more paged in if you add a second, earlier copy of the step where it got paged in.

We shouldn't be doing magic; if somehow this genuinely does improve performance we need to be able to explain why, in terms of something like a page eviction algorithm we can link to that's trying to be clever and really does care how long something has been cached.

This actually improves speed (at the cost of more memory usage). idk why yet

adamnovak · 2026-04-07T21:46:14Z

src/unittest/support/snarl_decomposition_fuzzer.cpp

+
+    // Step 2: Build pairing vector mapping each begin to its matching end
+    // and vice versa, using separate stacks for chains and snarls.
+    std::vector<size_t> pair_of(events.size());


pair_of is a bad name for this, it ought to be something like other_end.

(I'm pretty sure I put it in though.)

adamnovak · 2026-04-07T21:48:09Z

src/unittest/support/snarl_decomposition_fuzzer.cpp

+
+using ET = DecompositionEventType;
+
+// SnarlDecompositionFuzzer deterministic constructor


Suggested change

// SnarlDecompositionFuzzer deterministic constructor

This is more confusing than enlightening. We have documentation for the constructors in the HPP that explains how we have one that gets pre-loaded and one that doesn't, using enough words.

adamnovak · 2026-04-07T21:54:37Z

src/unittest/support/snarl_decomposition_fuzzer.hpp

+ * For deterministic testing, chains_to_flip can be provided, which is a set
+ * of (begin_handle, end_handle) pairs identifying chains to flip. When
+ * provided, p_flip and the generator are ignored.


Suggested change

* For deterministic testing, chains_to_flip can be provided, which is a set

* of (begin_handle, end_handle) pairs identifying chains to flip. When

* provided, p_flip and the generator are ignored.

* For non-randomized testing, the specific chains to flip can be

* pre-identified and provided on construction.

Even the randomized testing is meant to be deterministic. And we shouldn't start talking about parameter names as if they're real things here in the class doc comment.

adamnovak · 2026-04-07T21:56:38Z

src/unittest/support/snarl_decomposition_fuzzer.hpp

+    const HandleGraphSnarlFinder* wrapped;
+
+    /// Function that decides whether to flip a chain, given either of its
+    /// bounding node IDs. May be nondeterministic.


Suggested change

/// bounding node IDs. May be nondeterministic.

/// bounding node IDs. May produce different results when called

/// multiple times with the same input.

It still really ought to be a deterministic sequence of results for a sequence of calls and a random seed somewhere.

adamnovak · 2026-04-07T22:21:12Z

src/gbwtgraph_helper.cpp

+        // page cache. We also preload eagerly right after loading the index (in
+        // minimizer_main.cpp) so the kernel treats those pages as recently-used;
+        // together the two preloads prevent cache_payloads from page-faulting on
+        // every node under the memory pressure of 32 parallel threads.


This doesn't really explain how two passes of preloading could possibly help, either.

adamnovak · 2026-04-07T22:22:55Z

src/snarl_distance_index.cpp

+ * - Normal snarl: all rows
+ * - Oversized snarl: boundaries and tips
+ * - size_limit == 0: no distances in index, so no rows
+ * - Top-level chain distances only: ??? 


It would be good to figure this out and fill this in.

src/snarl_distance_index.cpp

adamnovak · 2026-04-07T22:25:58Z

src/snarl_distance_index.cpp

+#ifdef debug_hub_label_build
+  // Dump CHOverlay graph to stderr for debugging
+  std::cerr << "=== CHOverlay Graph Dump ===" << std::endl;
+  std::cerr << "Vertices: " << num_vertices(ov) << ", Edges: " << num_edges(ov) << std::endl;
+  std::cerr << "--- Nodes ---" << std::endl;
+  for (auto v : boost::make_iterator_range(vertices(ov))) {
+    const NodeProp& np = ov[v];
+    std::cerr << "Node " << v << ": seqlen=" << np.seqlen
+              << " max_out=" << np.max_out
+              << " contracted_neighbors=" << np.contracted_neighbors
+              << " level=" << np.level
+              << " arc_cover=" << np.arc_cover
+              << " contracted=" << (np.contracted ? "true" : "false")
+              // Skip new_id since it is not initialized until make_contraction_hierarchy is run.
+              << std::endl;
+  }
+  std::cerr << "--- Edges ---" << std::endl;
+  for (auto e : boost::make_iterator_range(edges(ov))) {
+    const EdgeProp& ep = ov[e];
+    std::cerr << "Edge " << source(e, ov) << " -> " << target(e, ov)
+              << ": contracted=" << (ep.contracted ? "true" : "false")
+              << " weight=" << ep.weight
+              << " arc_cover=" << ep.arc_cover
+              << " ori=" << (ep.ori ? "true" : "false") << std::endl;
+  }
+  std::cerr << "=== End CHOverlay Dump ===" << std::endl;
+#endif


I think I put it it here, but this could stand to become a CHOverlay method or debug function.

adamnovak · 2026-04-07T22:29:53Z

src/snarl_distance_index.cpp

+        if ( (temp_snarl_record.node_count > size_limit || size_limit == 0 || only_top_level_chain_distances) && (temp_snarl_record.is_root_snarl || start_normal_child)) {
            //If we don't care about internal distances, and we also are not at a boundary or tip
            //TODO: Why do we care about tips specifically?
            continue;
        }
+        //getting here means snarl is not oversized


We only even get into this function now if we're not oversized, or if size_limit is 0 and we're not including distances at all. This should be reworked to understand that.

Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>

electricEpilith and others added 30 commits November 12, 2025 14:32

some progress on hub label integration?

37848a6

hub labeling in (debugging not finished), also changes to deal with C…

297a11f

…++20 upgrade

Point at compatible libbdsg and get build working on Mac

9d4c2e2

Use the new indexing types and accessors to avoid fetching nodes by t…

8c13cf3

…he wrong thing

Use accessors so we can build the Tiny oversized snarl test index and…

788224d

… get a wrong answer

Try dumping hub label data for debugging

4985468

Add synthetic Boost graph dumping code, and missing semicolon, and li…

86e4e31

…bbdsg that makes labels that can fit

Merge remote-tracking branch 'origin/master' into hublabel

ee5bd54

Use libbdsg with slightly more implemented hub labeling integration

e84e657

Make sure NodeProp fields are not used before initialization

4f31496

Stop trying to look up removed trivial snarls

232a589

Add the debugging to subgraph finding that I needed to fix ChainRecor…

30e392a

…d asserts

Stop trying to interpret the root as a chain in debug prints

9639b68

Turn off debugging after passing existing snarl distance index tests

c0db406

Merge remote-tracking branch 'origin/master' into hublabel

ce1027f

Merge remote-tracking branch 'origin/hublabel' into hublabel

77c2ec2

Make randomized graph test actually exercise oversized snarls sometimes

163764f

Add function for loading a handlegraph from JSON

ddce5f4

Allow cactus-ifying all handle graphs

e56353f

Add synthetic fix for actually populating the unique_ptr right

4f66c25

Commit partial synthetic refactor to use new JSON load method

5436d73

Revert "Commit partial synthetic refactor to use new JSON load method"

2c3721d

This reverts commit 5436d73.

Replace string_to_graph with json2graph

695cff5

Remove a bunch of mostly unused functions for working with Protobuf G…

8ede8df

…raph objects

Mostly-automatically convert tests to use vg::io::json2graph

e472799

Remove duplicative JSON to graph function

904f445

Set up tiny test that breaks oversized snarl logic

809a766

Remove unused cases

f2d4f08

Fill in the dustances through oversized snarls to pass more distance …

caaa512

…indexing test cases

Add exhaustive test for small snarls

f15a5f9

adamnovak and others added 28 commits March 20, 2026 15:55

Add another test to make sure we aren't missing reversals hiding in t…

07d0472

…he middle of chains

don't build tests for sparsehash due to C++20 incompatibility

9aa832a

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

don't build tests for sparsehash due to C++20 incompatibility

85b2f44

Atomic-ize progress and remove uninformative comment text

b8d310d

Replace snarl tree depth limit with fixed point check

0281b81

Preload distance index only once

4422b08

Remove extra argument

aa2d827

Use libbdsg that should define child snarl count function

96b57ee

Regular-ify simple snarls

3c27df0

Merge pull request #4860 from vgteam/parallel-payload-caching

9b48cfe

Parallel payload caching

Merge pull request #4857 from vgteam/hublabel-debug

957ebeb

Track regualrity through distance index

merge newer commits into hublabel

e6343e0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

move libbdsg up

644b900

added back (more) preloading to speed minimizer back up

44321a3

passes all tests run by `make test` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

update oldest-supported-compiler-job, upgrade gcc requirement to 10 t…

0c5c0d9

…o support -std=c++20

edit correct place for GCC version notice

992c1cc

move up libbdsg to upgrade snarl distance index version number

7569cc1

add (a substantial amount of) instrumentation for vg giraffe

b81e331

planned out by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix abs() errors on Mac

528ec4e

Co-Authored-By: GitHub Copilot <noreply@github.com>

additional abs() fix

a3760d1

Co-Authored-By: GitHub Copilot <noreply@github.com>

minor print changes

7920c76

Co-Authored-By: GitHub Copilot <noreply@github.com>

Revert "minor print changes"

190469a

This reverts commit 7920c76.

Revert "add (a substantial amount of) instrumentation for vg giraffe"

3e573bd

This reverts commit b81e331.

second try at instrumentation

6cf266c

initial plan by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: GitHub Copilot <noreply@github.com>

Revert "second try at instrumentation"

e5513aa

This reverts commit 6cf266c.

Merge pull request #4868 from electricEpilith/hublabel

bea2589

Hublabel

snarl distance index version number update

e9c3d40

adamnovak reviewed Apr 7, 2026

View reviewed changes

Fix typo in src/snarl_distance_index.cpp

e286bf5

Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hublabel#4870

Hublabel#4870
electricEpilith wants to merge 75 commits intomasterfrom
hublabel

electricEpilith commented Apr 7, 2026

Uh oh!

adamnovak left a comment

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

electricEpilith Apr 8, 2026

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

adamnovak Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		using ET = DecompositionEventType;

		// SnarlDecompositionFuzzer deterministic constructor

	/// bounding node IDs. May be nondeterministic.
	/// bounding node IDs. May produce different results when called
	/// multiple times with the same input.

Conversation

electricEpilith commented Apr 7, 2026

Changelog Entry

Description

Uh oh!

adamnovak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants