migrating multi-hop tests from diskann-providers to diskann by JordanMaples · Pull Request #928 · microsoft/DiskANN

JordanMaples · 2026-04-08T22:16:25Z

This is a first step in the effort to migrate unit tests from diskann_async to diskann #927. If you don't think tests that have been brought over should be moved, they probably shouldn't have been. Please point them out and I'll do my best to put them back where I found them.

I'll update this field as I continue to work on it:

Asked copilot to isolate trivially migratable tests and move them. It landed on the Multi-Hop tests. Here are the notes it had for me when I asked about major differences between the two implementations:

1. Provider & quantization — The originals used new_quant_index (inmem provider with a
trained PQ table). The new tests use test_provider::Provider::grid() — no quantization at
all. Since these tests are about multihop traversal behavior, not quantization accuracy, this
shouldn't matter, but it does mean the new tests exercise a simpler code path through the
accessor.

2. Start point filtering — This is the biggest behavioral difference. The inmem provider's
post-processor includes FilterStartPoints, which strips the start point from results. The
test provider does not filter start points. This forced a change in
reject_all_returns_zero_results: the original asserted result_count == 0, the new version
asserts zero non-start-point results. The other three tests weren't affected.

3. Async → sync — Originals were #[tokio::test] async fn. The new tests are #[test] fn using
current_thread_runtime().block_on(), matching the existing pattern in
diskann/src/graph/test/cases/.

4. Grid setup — Originals manually built vectors, adjacency lists, trained PQ, and called
populate_data/populate_graph. The new tests use Provider::grid() which does everything in one
call — less surface area, but also means the graph topology is generated differently (by
synthetic::Grid rather than the utils::genererate_3d_grid_adj_list helper).

5. Filter types — Moved as-is, no logic changes.

The start point difference (#2) is the one most worth flagging to your reviewer — it's a
genuine behavioral gap, not just a mechanical port.

codecov-commenter · 2026-04-09T21:26:11Z

Codecov Report

❌ Patch coverage is 95.37445% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.32%. Comparing base (5b44ed3) to head (9c85d26).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
diskann/src/graph/test/cases/multihop.rs	95.11%	21 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #928      +/-   ##
==========================================
- Coverage   89.32%   89.32%   -0.01%     
==========================================
  Files         447      449       +2     
  Lines       83605    83491     -114     
==========================================
- Hits        74683    74579     -104     
+ Misses       8922     8912      -10

Flag	Coverage Δ
miri	`89.32% <95.37%> (-0.01%)`	⬇️
unittests	`89.16% <95.37%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-providers/src/index/diskann_async.rs	`96.74% <ø> (+0.34%)`	⬆️
diskann-providers/src/test_utils/search_utils.rs	`87.69% <ø> (ø)`
diskann/src/graph/search/multihop_search.rs	`99.44% <100.00%> (+1.36%)`	⬆️
diskann/src/graph/test/cases/multihop.rs	`95.11% <95.11%> (ø)`

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Move groundtruth, is_match, assert_top_k_exactly_match, and assert_range_results_exactly_match from diskann-providers/test_utils to diskann/graph/test/search_utils for cross-crate reuse - Migrate test_even_filtering_multihop to diskann as even_filtering_multihop, using test_provider::Provider::grid() - Remove test_multihop_filtering and test_even_filtering_multihop from diskann-providers - Update all consumers in diskann-providers to use shared search_utils Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add callback_enforces_filtering test to multihop.rs - Expand CallbackMetrics to track total_visits, rejected_count, adjusted_count (matching original) - Remove test_multihop_callback_enforces_filtering, CallbackFilter, and CallbackMetrics from diskann-providers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Gate is_match, assert_top_k_exactly_match, assert_range_results_exactly_match with #[cfg(test)] in diskann/graph/test/search_utils - Restore search_utils in diskann-providers/test_utils for cross-crate use, re-exporting groundtruth from diskann - Update diskann-providers and diskann-disk imports accordingly - Remove unused imports (Mutex, QueryVisitDecision, Knn) and dead code (test_multihop_search) from diskann-providers - Fix needless_range_loop in multihop.rs - Remove stale duplicate diskann dep in diskann-disk/Cargo.toml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add run_multihop_search() to eliminate repeated runtime/buffer/search boilerplate across 5 tests - Add l2_groundtruth() to deduplicate brute-force groundtruth computation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Verifies that multihop search can discover matching nodes that are only reachable through non-matching nodes, exercising the core two-hop expansion behavior of the multihop algorithm. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Migrates a first set of multi-hop search traversal unit tests from diskann_async into the diskann crate’s graph test suite, and introduces shared search ground-truth utilities to support those tests.

Changes:

Added multihop test cases under diskann/src/graph/test/cases/ and wired them into the test module.
Introduced diskann::graph::test::search_utils with ground-truth + assertion helpers for search verification.
Removed the migrated multi-hop test helpers/cases from diskann-providers/src/index/diskann_async.rs and adjusted diskann-providers test utils module visibility/docs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
diskann/src/graph/test/search_utils.rs	Adds ground-truth computation and assertion helpers for graph search tests.
diskann/src/graph/test/mod.rs	Exposes the new `search_utils` module in the graph test module.
diskann/src/graph/test/cases/multihop.rs	Adds migrated multi-hop traversal/filtering/termination/callback tests.
diskann/src/graph/test/cases/mod.rs	Registers the new `multihop` test module.
diskann-providers/src/test_utils/search_utils.rs	Updates/clarifies docs around duplicated ground-truth helpers for provider-side tests.
diskann-providers/src/test_utils/mod.rs	Makes `search_utils` publicly accessible from `diskann-providers::test_utils`.
diskann-providers/src/index/diskann_async.rs	Removes migrated multi-hop-related tests and supporting helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T16:30:30Z

+/// Asserts that the range results exactly match the ground truth.
+///
+/// For each of the range results, this function verifies that both the distance and ID
+/// match exactly with what's expected in the ground truth.
+#[cfg(test)]
+pub fn assert_range_results_exactly_match(
+    query_id: usize,
+    gt: &[Neighbor<u32>],
+    ids: &[u32],
+    radius: f32,
+    inner_radius: Option<f32>,
+) {
+    let gt_ids = if let Some(inner_radius) = inner_radius {
+        gt.iter()
+            .filter(|nbh| nbh.distance >= inner_radius && nbh.distance <= radius)
+            .map(|nbh| nbh.id)
+            .collect::<Vec<_>>()
+    } else {
+        gt.iter()
+            .filter(|nbh| nbh.distance <= radius)
+            .map(|nbh| nbh.id)
+            .collect::<Vec<_>>()
+    };
+    if ids.iter().any(|id| !gt_ids.contains(id)) {
+        panic!(
+            "query {}: found ids {:?} in range search with radius {}, inner radius {}, but expected {:?}",
+            query_id,
+            ids,
+            radius,
+            inner_radius.unwrap_or(f32::MIN),
+            gt_ids
+        );
+    }
+}


The docs for assert_range_results_exactly_match claim it “exactly” verifies distance+ID matches the ground truth, but the implementation only checks that all returned ids are contained in the expected ID list (no distance checks, no check for missing expected IDs, no multiplicity/duplicate handling, and no “exact match” set equality). Please either (mandatory) update the docs/name to reflect the actual subset-membership behavior, or (optional) change the implementation to assert true set (or multiset) equality between expected-in-range and returned-in-range results (and include distance validation if that’s required).

Copilot · 2026-04-14T16:30:31Z

+            .map(|nbh| nbh.id)
+            .collect::<Vec<_>>()
+    };
+    if ids.iter().any(|id| !gt_ids.contains(id)) {


gt_ids.contains(id) is an O(n) scan inside an any(...) over ids, making this check O(|ids| * |gt_ids|). Even for tests this can add up as sizes grow. Consider materializing gt_ids as a HashSet<u32> (or sorting and using binary search) so membership checks are O(1) (or O(log n)).

Copilot · 2026-04-14T16:30:31Z

+    let mut gt = search_utils::groundtruth(data.as_view(), query, |a, b| SquaredL2::evaluate(a, b));
+    gt.sort_unstable_by(|a, b| a.cmp(b).reverse());
+    gt


search_utils::groundtruth(...) already sorts the results (and documents that nearest neighbors end up at the end). The extra sort_unstable_by(...) here is redundant and makes it harder to reason about the intended ordering. Removing the second sort (or removing sorting from groundtruth and doing it explicitly at call sites) would keep the ordering contract single-sourced.

Suggested change

let mut gt = search_utils::groundtruth(data.as_view(), query, |a, b| SquaredL2::evaluate(a, b));

gt.sort_unstable_by(|a, b| a.cmp(b).reverse());

gt

search_utils::groundtruth(data.as_view(), query, |a, b| SquaredL2::evaluate(a, b))

Copilot · 2026-04-14T16:30:31Z

+    if filter.hits().contains(&boosted_point) {
+        assert!(
+            boosted_in_adjusted.is_some(),
+            "boosted point should appear in adjusted results when visited"
+        );
+        if let (Some(baseline_pos), Some(adjusted_pos)) = (boosted_in_baseline, boosted_in_adjusted)
+        {
+            assert!(
+                adjusted_pos <= baseline_pos,
+                "boosted point should rank equal or better after distance reduction"
+            );
+        }
+    }


distance_adjustment_affects_ranking can pass without asserting anything about ranking changes if boosted_point is never visited (the whole assertion block is skipped). That creates a false-positive test that may silently stop validating the intended behavior. To make the test deterministic, consider asserting that the boosted point was visited/adjusted (e.g., via filter.metrics().adjusted_count >= 1) or adjusting the test setup/search params so the boosted point is guaranteed to be encountered.

harsha-simhadri · 2026-04-15T08:20:28Z

+//! The canonical `groundtruth` implementation lives in `diskann::graph::test::search_utils`,
+//! but that module is gated behind `cfg(test)` / `feature = "testing"` and is not available
+//! in non-test builds. This module duplicates the functions needed by `diskann-providers`
+//! and `diskann-disk` so they compile unconditionally.


should we address this duplication?

I tried to have copilot address the duplication, but it kept running into problems. I'll give it another go today and see if Mark has any ideas on keeping a single source of truth for both modules to pull in cleanly

hildebrandmw

Thanks Jordan! This is moving the tests in the right direction, but we need to be careful about just moving the test infrastructure from diskann_async.rs as-is.
The utilities in search_utils.rs are extremely awkward for actually running tests, are very sensitive, and don't provide much useful information when they fire.
They've been used in diskann_async.rs because it was kind of the best thing we had back then.

My hope is that the new diskann can take a higher signal approach using baselines and VerboseEq.
Not only does this provide a really good way of viewing the expected results as a whole, it's also great for storing additional metrics.
For example, the stats, ids, and distances from multi-hop search can all be checked in as part of the baseline and get protected for free.

My ask is to not migrate the search_utils.rs as is - especially if it means including test methods that aren't actually used by diskann.
Also, use the baseline capturing mechanism to capture everything about both test setups and results.
We cannot rely solely on the baseline to protect against regression (someone could check-in a broken baseline in the future), but a baseline in combination with some invariant checks (returned items should be filtered/adjusted) will go a long way toward good tests.

hildebrandmw · 2026-04-15T19:13:41Z

+    // The start point (u32::MAX) is seeded directly into the candidate set and bypasses
+    // both is_match and on_visit. It may appear in results. All non-start-point results
+    // should be zero since on_visit rejects everything.
+    let non_start_results = (0..stats.result_count as usize)


We probably shouldn't have a "maybe" in here. The start point is either included or it isn't. We've changed the default for the test provided to be excluded, so is this check still needed?

hildebrandmw · 2026-04-15T19:16:06Z

+    let target = (num_points / 2) as u32;
+    let filter = TerminatingFilter::new(target);
+
+    let (_index, stats, _ids, _distances) = run_multihop_search(grid_size, &query, 10, 40, &filter);


We can include all of grid_size, query, target, k and l in a checked-in baseline, along with stats, ids, and distances to make this a much better test.

hildebrandmw · 2026-04-15T19:17:33Z

+        .take(adjusted_stats.result_count as usize)
+        .position(|&id| id == boosted_point);
+
+    if filter.hits().contains(&boosted_point) {


This is a little surprising to have in a test. If the filter does not contain the boosted point, we don't check anything. I feel like we should know whether or not the boosted point was expected.

hildebrandmw · 2026-04-15T19:18:50Z

+    let (_index, _stats, _ids, _distances) =
+        run_multihop_search(grid_size, &query, 10, 100, &filter);
+
+    // Allow some slack for beam expansion.


This comment is surprisingly cryptic. Again, a baseline capturing everything would be much better.

hildebrandmw · 2026-04-15T19:20:24Z

+    // All returned results should be even (matching the filter).
+    for &id in &result_ids {
+        if id == u32::MAX {
+            continue; // start point may appear


Do we still need this check?

JordanMaples · 2026-04-16T14:38:02Z

converting back to draft as it needs some more human refinement

Test eval() and eval_mut() behavior: visited-set exclusion, label matching, and insert-on-match semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace migrated integration tests with focused unit tests that call multihop_search_internal directly on small hand-constructed graphs: - accept_all_finds_all_nodes: one-hop expansion with AcceptAll filter - reject_triggers_two_hop_expansion: EvenFilter rejection triggers two-hop - reject_all_yields_only_start: RejectAll leaves only start in best set - terminate_stops_search_on_target: TerminateOnTarget stops search early - block_and_adjust_modifies_results: blocked node excluded, distance adjusted Add integration tests with VerboseEq baselines: - two_hop_reaches_through_non_matching: end-to-end with invariants - even_filtering_grid: 3D grid with even-only filter - callback_filtering_grid: block+adjust with full metrics baseline Remove search_utils.rs (only used by old multihop tests). Make multihop_search module pub(crate) for direct internal testing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JordanMaples force-pushed the jordanmaples/migrate_unittests branch 2 times, most recently from 5a59f22 to 5655eea Compare April 9, 2026 21:12

JordanMaples and others added 6 commits April 13, 2026 08:10

copilot's handy work

a3c5986

clippy

385b875

formatter

b7d238b

JordanMaples force-pushed the jordanmaples/migrate_unittests branch from 8105440 to b7d238b Compare April 13, 2026 15:12

JordanMaples and others added 5 commits April 13, 2026 08:33

had ai fix the clippy errors

1558165

Merge branch 'main' into jordanmaples/migrate_unittests

aa46d06

Merge branch 'main' into jordanmaples/migrate_unittests

299c609

JordanMaples marked this pull request as ready for review April 14, 2026 16:22

JordanMaples requested review from a team and Copilot April 14, 2026 16:22

JordanMaples changed the title ~~[partial][draft][in-progress] migrating unit tests~~ migrating multi-hop tests from diskann-provider to diskann Apr 14, 2026

JordanMaples changed the title ~~migrating multi-hop tests from diskann-provider to diskann~~ migrating multi-hop tests from diskann-async to diskann Apr 14, 2026

JordanMaples changed the title ~~migrating multi-hop tests from diskann-async to diskann~~ migrating multi-hop tests from diskann-providers to diskann Apr 14, 2026

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Copilot started reviewing on behalf of JordanMaples April 14, 2026 17:32 View session

harsha-simhadri reviewed Apr 15, 2026

View reviewed changes

Merge branch 'main' into jordanmaples/migrate_unittests

1dbbda8

hildebrandmw reviewed Apr 15, 2026

View reviewed changes

JordanMaples marked this pull request as draft April 16, 2026 14:37

JordanMaples and others added 2 commits April 17, 2026 14:11

Add unit tests for NotInMutWithLabelCheck predicate

6c7dd85

Test eval() and eval_mut() behavior: visited-set exclusion, label matching, and insert-on-match semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Conversation

JordanMaples commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

JordanMaples Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

JordanMaples commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JordanMaples commented Apr 8, 2026 •

edited

Loading

codecov-commenter commented Apr 9, 2026 •

edited

Loading