Conversation
… get a wrong answer
…bbdsg that makes labels that can fit
This reverts commit 5436d73.
…indexing test cases
…he middle of chains
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
cache_payloads was single-threaded despite the -t flag; with 164M nodes on an HPRC graph it hung for hours. Two fixes: 1. Pass `true` to for_each_handle to enable OpenMP parallelism; guard the non-thread-safe writes (oversized_zipcodes vector and node_id_to_payload map) with named omp critical sections. 2. Call distance_index->preload(true) immediately before cache_payloads in build_minimizer_index. find_frequent_kmers runs for ~3300 s before this point and evicts the mmap'd index pages, causing a page fault on every snarl-tree lookup in fill_in_zipcode_from_pos. Reloading here ensures the index is warm when the parallel loop starts. Also add a depth guard (abort at >10000) in fill_in_zipcode_from_pos to catch any future infinite loops in the snarl tree traversal. Also use distance_index.get_snarl_child_count() (O(1) record read) instead of for_each_child iteration in get_regular/irregular_snarl_code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parallel payload caching
Track regualrity through distance index
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
passes all tests run by `make test` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o support -std=c++20
planned out by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
Co-Authored-By: GitHub Copilot <noreply@github.com>
This reverts commit 7920c76.
This reverts commit b81e331.
initial plan by Claude Opus 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: GitHub Copilot <noreply@github.com>
This reverts commit 6cf266c.
adamnovak
left a comment
There was a problem hiding this comment.
Here's my review, including a bunch of stuff I now want to change about code I committed.
The changes to libbdsg also need to be reviewed.
| // re-preload right before cache_payloads. The double-preload is | ||
| // necessary: a single preload just before cache_payloads isn't enough | ||
| // to keep the index resident under the memory pressure of 32 parallel | ||
| // threads and the remaining in-memory data structures. |
There was a problem hiding this comment.
I don't think that's how page caching works; if it's paged in it's paged in, right? It can't possibly get more paged in if you add a second, earlier copy of the step where it got paged in.
We shouldn't be doing magic; if somehow this genuinely does improve performance we need to be able to explain why, in terms of something like a page eviction algorithm we can link to that's trying to be clever and really does care how long something has been cached.
There was a problem hiding this comment.
This actually improves speed (at the cost of more memory usage). idk why yet
|
|
||
| // Step 2: Build pairing vector mapping each begin to its matching end | ||
| // and vice versa, using separate stacks for chains and snarls. | ||
| std::vector<size_t> pair_of(events.size()); |
There was a problem hiding this comment.
pair_of is a bad name for this, it ought to be something like other_end.
(I'm pretty sure I put it in though.)
|
|
||
| using ET = DecompositionEventType; | ||
|
|
||
| // SnarlDecompositionFuzzer deterministic constructor |
There was a problem hiding this comment.
| // SnarlDecompositionFuzzer deterministic constructor |
This is more confusing than enlightening. We have documentation for the constructors in the HPP that explains how we have one that gets pre-loaded and one that doesn't, using enough words.
| * For deterministic testing, chains_to_flip can be provided, which is a set | ||
| * of (begin_handle, end_handle) pairs identifying chains to flip. When | ||
| * provided, p_flip and the generator are ignored. |
There was a problem hiding this comment.
| * For deterministic testing, chains_to_flip can be provided, which is a set | |
| * of (begin_handle, end_handle) pairs identifying chains to flip. When | |
| * provided, p_flip and the generator are ignored. | |
| * For non-randomized testing, the specific chains to flip can be | |
| * pre-identified and provided on construction. |
Even the randomized testing is meant to be deterministic. And we shouldn't start talking about parameter names as if they're real things here in the class doc comment.
| const HandleGraphSnarlFinder* wrapped; | ||
|
|
||
| /// Function that decides whether to flip a chain, given either of its | ||
| /// bounding node IDs. May be nondeterministic. |
There was a problem hiding this comment.
| /// bounding node IDs. May be nondeterministic. | |
| /// bounding node IDs. May produce different results when called | |
| /// multiple times with the same input. |
It still really ought to be a deterministic sequence of results for a sequence of calls and a random seed somewhere.
| // page cache. We also preload eagerly right after loading the index (in | ||
| // minimizer_main.cpp) so the kernel treats those pages as recently-used; | ||
| // together the two preloads prevent cache_payloads from page-faulting on | ||
| // every node under the memory pressure of 32 parallel threads. |
There was a problem hiding this comment.
This doesn't really explain how two passes of preloading could possibly help, either.
| * - Normal snarl: all rows | ||
| * - Oversized snarl: boundaries and tips | ||
| * - size_limit == 0: no distances in index, so no rows | ||
| * - Top-level chain distances only: ??? |
There was a problem hiding this comment.
It would be good to figure this out and fill this in.
| #ifdef debug_hub_label_build | ||
| // Dump CHOverlay graph to stderr for debugging | ||
| std::cerr << "=== CHOverlay Graph Dump ===" << std::endl; | ||
| std::cerr << "Vertices: " << num_vertices(ov) << ", Edges: " << num_edges(ov) << std::endl; | ||
| std::cerr << "--- Nodes ---" << std::endl; | ||
| for (auto v : boost::make_iterator_range(vertices(ov))) { | ||
| const NodeProp& np = ov[v]; | ||
| std::cerr << "Node " << v << ": seqlen=" << np.seqlen | ||
| << " max_out=" << np.max_out | ||
| << " contracted_neighbors=" << np.contracted_neighbors | ||
| << " level=" << np.level | ||
| << " arc_cover=" << np.arc_cover | ||
| << " contracted=" << (np.contracted ? "true" : "false") | ||
| // Skip new_id since it is not initialized until make_contraction_hierarchy is run. | ||
| << std::endl; | ||
| } | ||
| std::cerr << "--- Edges ---" << std::endl; | ||
| for (auto e : boost::make_iterator_range(edges(ov))) { | ||
| const EdgeProp& ep = ov[e]; | ||
| std::cerr << "Edge " << source(e, ov) << " -> " << target(e, ov) | ||
| << ": contracted=" << (ep.contracted ? "true" : "false") | ||
| << " weight=" << ep.weight | ||
| << " arc_cover=" << ep.arc_cover | ||
| << " ori=" << (ep.ori ? "true" : "false") << std::endl; | ||
| } | ||
| std::cerr << "=== End CHOverlay Dump ===" << std::endl; | ||
| #endif |
There was a problem hiding this comment.
I think I put it it here, but this could stand to become a CHOverlay method or debug function.
| if ( (temp_snarl_record.node_count > size_limit || size_limit == 0 || only_top_level_chain_distances) && (temp_snarl_record.is_root_snarl || start_normal_child)) { | ||
| //If we don't care about internal distances, and we also are not at a boundary or tip | ||
| //TODO: Why do we care about tips specifically? | ||
| continue; | ||
| } | ||
| //getting here means snarl is not oversized |
There was a problem hiding this comment.
We only even get into this function now if we're not oversized, or if size_limit is 0 and we're not including distances at all. This should be reworked to understand that.
Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>
Changelog Entry
To be copied to the draft changelog by merger:
Description
Adds hub labeling functionality to the snarl distance index.