Skip to content

Hublabel#4870

Open
electricEpilith wants to merge 75 commits intomasterfrom
hublabel
Open

Hublabel#4870
electricEpilith wants to merge 75 commits intomasterfrom
hublabel

Conversation

@electricEpilith
Copy link
Copy Markdown

Changelog Entry

To be copied to the draft changelog by merger:

  • Add hub labeling to distance index, which allows efficient exact shortest distance queries even in "oversized" snarls
  • Bug fix for minimizer, significant speed improvement

Description

Adds hub labeling functionality to the snarl distance index.

electricEpilith and others added 30 commits November 12, 2025 14:32
adamnovak and others added 28 commits March 20, 2026 15:55
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
cache_payloads was single-threaded despite the -t flag; with 164M nodes
on an HPRC graph it hung for hours. Two fixes:

1. Pass `true` to for_each_handle to enable OpenMP parallelism; guard
   the non-thread-safe writes (oversized_zipcodes vector and
   node_id_to_payload map) with named omp critical sections.

2. Call distance_index->preload(true) immediately before cache_payloads
   in build_minimizer_index. find_frequent_kmers runs for ~3300 s before
   this point and evicts the mmap'd index pages, causing a page fault on
   every snarl-tree lookup in fill_in_zipcode_from_pos. Reloading here
   ensures the index is warm when the parallel loop starts.

Also add a depth guard (abort at >10000) in fill_in_zipcode_from_pos to
catch any future infinite loops in the snarl tree traversal.

Also use distance_index.get_snarl_child_count() (O(1) record read)
instead of for_each_child iteration in get_regular/irregular_snarl_code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track regualrity through distance index
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
passes all tests run by `make test`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
planned out by Claude Opus 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
initial plan by Claude Opus 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
Copy link
Copy Markdown
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my review, including a bunch of stuff I now want to change about code I committed.

The changes to libbdsg also need to be reviewed.

Comment on lines +96 to +99
// re-preload right before cache_payloads. The double-preload is
// necessary: a single preload just before cache_payloads isn't enough
// to keep the index resident under the memory pressure of 32 parallel
// threads and the remaining in-memory data structures.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's how page caching works; if it's paged in it's paged in, right? It can't possibly get more paged in if you add a second, earlier copy of the step where it got paged in.

We shouldn't be doing magic; if somehow this genuinely does improve performance we need to be able to explain why, in terms of something like a page eviction algorithm we can link to that's trying to be clever and really does care how long something has been cached.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually improves speed (at the cost of more memory usage). idk why yet


// Step 2: Build pairing vector mapping each begin to its matching end
// and vice versa, using separate stacks for chains and snarls.
std::vector<size_t> pair_of(events.size());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pair_of is a bad name for this, it ought to be something like other_end.

(I'm pretty sure I put it in though.)


using ET = DecompositionEventType;

// SnarlDecompositionFuzzer deterministic constructor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SnarlDecompositionFuzzer deterministic constructor

This is more confusing than enlightening. We have documentation for the constructors in the HPP that explains how we have one that gets pre-loaded and one that doesn't, using enough words.

Comment on lines +54 to +56
* For deterministic testing, chains_to_flip can be provided, which is a set
* of (begin_handle, end_handle) pairs identifying chains to flip. When
* provided, p_flip and the generator are ignored.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* For deterministic testing, chains_to_flip can be provided, which is a set
* of (begin_handle, end_handle) pairs identifying chains to flip. When
* provided, p_flip and the generator are ignored.
* For non-randomized testing, the specific chains to flip can be
* pre-identified and provided on construction.

Even the randomized testing is meant to be deterministic. And we shouldn't start talking about parameter names as if they're real things here in the class doc comment.

const HandleGraphSnarlFinder* wrapped;

/// Function that decides whether to flip a chain, given either of its
/// bounding node IDs. May be nondeterministic.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// bounding node IDs. May be nondeterministic.
/// bounding node IDs. May produce different results when called
/// multiple times with the same input.

It still really ought to be a deterministic sequence of results for a sequence of calls and a random seed somewhere.

Comment on lines +541 to +544
// page cache. We also preload eagerly right after loading the index (in
// minimizer_main.cpp) so the kernel treats those pages as recently-used;
// together the two preloads prevent cache_payloads from page-faulting on
// every node under the memory pressure of 32 parallel threads.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really explain how two passes of preloading could possibly help, either.

* - Normal snarl: all rows
* - Oversized snarl: boundaries and tips
* - size_limit == 0: no distances in index, so no rows
* - Top-level chain distances only: ???
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to figure this out and fill this in.

Comment on lines +1178 to +1204
#ifdef debug_hub_label_build
// Dump CHOverlay graph to stderr for debugging
std::cerr << "=== CHOverlay Graph Dump ===" << std::endl;
std::cerr << "Vertices: " << num_vertices(ov) << ", Edges: " << num_edges(ov) << std::endl;
std::cerr << "--- Nodes ---" << std::endl;
for (auto v : boost::make_iterator_range(vertices(ov))) {
const NodeProp& np = ov[v];
std::cerr << "Node " << v << ": seqlen=" << np.seqlen
<< " max_out=" << np.max_out
<< " contracted_neighbors=" << np.contracted_neighbors
<< " level=" << np.level
<< " arc_cover=" << np.arc_cover
<< " contracted=" << (np.contracted ? "true" : "false")
// Skip new_id since it is not initialized until make_contraction_hierarchy is run.
<< std::endl;
}
std::cerr << "--- Edges ---" << std::endl;
for (auto e : boost::make_iterator_range(edges(ov))) {
const EdgeProp& ep = ov[e];
std::cerr << "Edge " << source(e, ov) << " -> " << target(e, ov)
<< ": contracted=" << (ep.contracted ? "true" : "false")
<< " weight=" << ep.weight
<< " arc_cover=" << ep.arc_cover
<< " ori=" << (ep.ori ? "true" : "false") << std::endl;
}
std::cerr << "=== End CHOverlay Dump ===" << std::endl;
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I put it it here, but this could stand to become a CHOverlay method or debug function.

Comment on lines +1310 to +1315
if ( (temp_snarl_record.node_count > size_limit || size_limit == 0 || only_top_level_chain_distances) && (temp_snarl_record.is_root_snarl || start_normal_child)) {
//If we don't care about internal distances, and we also are not at a boundary or tip
//TODO: Why do we care about tips specifically?
continue;
}
//getting here means snarl is not oversized
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only even get into this function now if we're not oversized, or if size_limit is 0 and we're not including distances at all. This should be reworked to understand that.

Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants