You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Introduce a simplified catalog-driven dataset loader and fully switch benchmarking to use it.
- Disable/remove legacy dataset loaders and legacy scrubbing behavior; dataset metadata now uses NO_SCRUB.
- Add transport-routed remote loading with S3 support, shared client reuse, and parallel base/query/gt downloads.
- Support local catalog auto-discovery, base_url indirection, and YAML-driven dataset configuration.
- Cache included remote catalogs locally so previously downloaded remote datasets remain usable offline, while preserving local catalog override precedence.
- Improve loader robustness with better error handling, safer logging redaction, and more reliable dataset metadata path resolution across working directories.
- Expand test coverage for remote catalog loading, local-vs-remote precedence, offline cached-catalog behavior, and Windows env-var handling.
- Refactor dataset catalog/layout conventions, including public/protected cache subdirectories under dataset_cache.
- Refresh loader and dataset documentation, including local/remote behavior, benchmarking paths, and catalog examples.
- Clean up supporting project configuration, including .gitignore, RAT excludes, GitHub Actions dataset secret handling, and related benchmark YAML organization.
---------
Co-authored-by: Ted Willke <ted.willke@gmail.com>
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ You may also use method-level filtering and patterns, e.g.,
74
74
(The `failIfNoSpecifiedTests` option works around a quirk of surefire: it is happy to run `test` with submodules with empty test sets,
75
75
but as soon as you supply a filter, it wants at least one match in every submodule.)
76
76
77
-
You can run `SiftSmall` and `Bench` directly to get an idea of what all is going on here. `Bench` will automatically download required datasets to the `fvec` and `hdf5` directories.
77
+
You can run `SiftSmall` and `Bench` directly to get an idea of what all is going on here. `Bench` will automatically download required datasets to the `dataset_cache` directory.
78
78
The files used by `SiftSmall` can be found in the [siftsmall directory](./siftsmall) in the project root.
79
79
80
80
To run either class, you can use the Maven exec-plugin via the following incantations:
Copy file name to clipboardExpand all lines: docs/benchmarking.md
+42-55Lines changed: 42 additions & 55 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,21 +4,19 @@ JVector comes with a built-in benchmarking system in `jvector-examples/.../Bench
4
4
5
5
To run a benchmark
6
6
- Decide which dataset(s) you want to benchmark. A dataset consists of
7
-
- The vectors to be indexed, usually called the "base" or "target" vectors.
8
-
- The query vectors.
9
-
- The "ground truth" results which are used to compute accuracy metrics.
10
-
- The similarity metric which should have been used to compute the ground truth (dot product, cosine similarity or L2 distance).
11
-
- Configure the parameters combinations for which you want to run the benchmark. This includes graph index parameters, quantization parameters and search parameters.
7
+
- The vectors to be indexed, usually called the "base" or "target" vectors
8
+
- The query vectors
9
+
- The "ground truth" results that are used to compute accuracy metrics
10
+
- The similarity metric used compute the ground truth (dot product, cosine similarity or L2 distance)
11
+
- Configure the parameters combinations for which you want to run the benchmark. This includes index construction parameters, quantization parameters and search parameters.
12
12
13
-
JVector supports two types of datasets:
14
-
-**Fvec/Ivec**: The dataset consists of three files, for example `base.fvec`, `queries.fvec` and `neighbors.ivec` containing the base vectors, query vectors, and ground truth. (`fvec` and `ivec` file formats are described [here](http://corpus-texmex.irisa.fr/))
15
-
-**HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth.
13
+
JVector supports datasets in the fvecs/ivecs format. These consist of three files, for example `base.fvecs`, `queries.fvecs` and `neighbors.ivecs` containing the base vectors, query vectors, and ground truth. (`fvecs` and `ivecs` file formats are described [here](http://corpus-texmex.irisa.fr/))
16
14
17
15
The general procedure for running benchmarks is mentioned below. The following sections describe the process in more detail.
18
16
-[Specify the dataset](#specifying-datasets) names to benchmark in `datasets.yml`.
19
17
- Certain datasets will be downloaded automatically. If using a different dataset, make sure the dataset files are downloaded and made available (refer the section on [Custom datasets](#custom-datasets)).
20
-
- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets to be benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<your-dataset-name>.yml` in the same folder.
21
-
- Decide on the kind of measurements and logging you want and configure them in `run.yml`.
18
+
- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<your-dataset-name>.yml` in the `index-parameters` subfolder.
19
+
- Decide on the kind of measurements and logging you want and configure them in `run-config.yml`.
22
20
23
21
You can run the configured benchmark with maven:
24
22
```sh
@@ -31,31 +29,28 @@ The datasets you want to benchmark should be specified in `jvector-examples/yaml
31
29
32
30
To benchmark a single dataset, comment out the entries corresponding to all other datasets. (Or provide command line arguments as described in [Running `bench` from the command line](#running-bench-from-the-command-line))
33
31
34
-
Datasets are assumed to be Fvec/Ivec based unless the entry in the `datasets.yml` ends with `.hdf5`. In this case, `.hdf5` is not considered part of the "dataset name" referenced in other sections.
32
+
Datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.
35
33
36
-
You'll notice that datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.
37
-
38
-
For HDF5 files, the substrings `-angular`, `-euclidean` and `-dot` correspond to cosine similarity, L2 distance, and dot product similarity functions (these substrings ARE considered to be part of the "dataset name"). Currently, Fvec/Ivec datasets are implicitly assumed to use cosine similarity (changing this requires editing `DataSetLoaderMFD.java`).
34
+
Dataset similarity functions are configured in `jvector-examples/yaml-configs/dataset-metadata.yml`.
- some-dataset-euclidean # fvec/ivec dataset, cosine similarity (NOT L2 unless you change the code!)
43
+
- another-dataset-a
44
+
- another-dataset-b
50
45
```
51
46
52
47
## Setting benchmark parameters
53
48
54
49
### default.yml / \<dataset-name\>.yml
55
50
56
-
`jvector-examples/yaml-configs/default.yml` specifies the default index construction and search parameters to be used by `bench` for all datasets.
51
+
`jvector-examples/yaml-configs/index-parameters/default.yml` specifies the default index construction and search parameters to be used by `bench` for all datasets.
57
52
58
-
You can specify a custom set of a parameters for any given dataset by creating a file called `<dataset-name>.yml`, with `<dataset-name>` replaced by the actual name of the dataset. This is the same as the identifier used in `datasets.yml`, but without the `.hdf5` suffix for hdf5 datasets. The format of this file is exactly the same as `default.yml`.
53
+
You can specify a custom set of a parameters for any given dataset by creating a file called `<dataset-name>.yml`, with `<dataset-name>` replaced by the actual name of the dataset. This is the same as the identifier used in `datasets.yml`. The format of this file is exactly the same as `default.yml`.
59
54
60
55
Refer to `default.yml` for a list of all options.
61
56
@@ -67,15 +62,15 @@ construction:
67
62
```
68
63
will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is particularly useful when running a Grid search to identify the best performing parameters.
69
64
70
-
### run.yml
65
+
### run-config.yml
71
66
72
67
This file contains configurations for
73
68
- Specifying the measurements you want to report, like QPS, latency and recall
74
69
- Specifying where to output these measurements, i.e. to the console, or to a file, or both.
75
70
76
71
The configurations in this file are "run-level", meaning that they are shared across all the datasets being benchmarked.
77
72
78
-
See `run.yml` for a full list of all options.
73
+
See `run-config.yml` for a full list of all options.
Using fvec/ivec datasets requires them to be configured in `DataSetLoaderMFD.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.
98
-
99
-
To use a custom dataset consisting of files `base.fvec`, `queries.fvec` and `neighbors.ivec`, do the following:
100
-
- Ensure that you have three files:
101
-
-`base.fvec` containing N D-dimensional float vectors. These are used to build the index.
102
-
-`queries.fvec` containing Q D-dimensional float vectors. These are used for querying the built index.
103
-
-`neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
104
-
The files can be named however you like.
105
-
- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
106
-
- Edit `DataSetLoaderMFD.java` to configure a new dataset and it's associated files:
107
-
```java
108
-
put("cust-ds", newMultiFileDatasource("cust-ds",
109
-
"cust-ds/base.fvec",
110
-
"cust-ds/query.fvec",
111
-
"cust-ds/neighbors.ivec"));
90
+
Datasets are configured via YAML catalog files under `jvector-examples/yaml-configs/dataset-catalogs/`. The loader recursively discovers all `.yaml`/`.yml` files in that directory tree. See `jvector-examples/yaml-configs/dataset-catalogs/local-catalog.yaml` for the full format reference.
91
+
92
+
To add a custom fvecs/ivecs dataset:
93
+
94
+
1. Add a `.yaml` file to the YAML catalog directory, mapping your dataset name to its files:
95
+
```yaml
96
+
_defaults:
97
+
cache_dir: ${DATASET_CACHE_DIR:-dataset_cache}
98
+
99
+
my-dataset:
100
+
base: my_base_vectors.fvecs
101
+
query: my_query_vectors.fvecs
102
+
gt: my_ground_truth.ivecs
103
+
```
104
+
2. Place your fvecs/ivecs files at the paths you specified in the YAML (or specify a `cache_dir` / `base_url` to fetch them from a remote source).
105
+
3. Add the dataset's similarity function to `jvector-examples/yaml-configs/dataset-metadata.yml`:
106
+
```yaml
107
+
my-dataset:
108
+
similarity_function: COSINE
109
+
load_behavior: NO_SCRUB
112
110
```
113
-
The file paths are resolved relative to the `fvec` directory. `cust-ds` is the name of the dataset and can be changed to whatever is appropriate.
114
-
-In `jvector-examples/yaml-configs/datasets.yml`, add an entry corresponding to your custom dataset. Comment out other datasets which you do not want to benchmark.
111
+
4. Add the dataset name to `jvector-examples/yaml-configs/datasets.yml` so BenchYAML can find it:
115
112
```yaml
116
113
custom:
117
-
-cust-ds
114
+
- my-dataset
118
115
```
119
116
120
-
## CustomHDF5 datasets
121
-
122
-
HDF5 datasets consist of a single file. TheHdf5Loader looks for three HDF5 datasets within the file, `train`, `test` and `neighbors`.These correspond to the base, query and neighbors vectors described above for fvec/ivec files.
123
-
124
-
To use an HDF5 dataset, edit `jvector-examples/yaml-configs/datasets.yml` to add an entry like the following:
125
-
```yaml
126
-
category:
127
-
-<dataset-name>.hdf5
128
-
```
129
-
130
-
BenchYAML looks for hdf5 datasets with the name `<dataset-name>.hdf5` in the `hdf5` folder in the root of this repo. If the file doesn't exist, BenchYAML will attempt to automatically download the dataset from ann-benchmarks.com. If your dataset is not from ann-benchmarks.com, simply ensure that the dataset is available in the `hdf5` folder and edit `datasets.yml` accordingly.
117
+
For remote datasets, use `base_url` to specify where files should be downloaded from. The `${VAR}` and `${VAR:-default}` syntax is supported for environment variable expansion. See the example config for details.
Copy file name to clipboardExpand all lines: jvector-examples/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,8 +11,8 @@ A simple benchmark for the sift dataset located in the [siftsmall](./siftsmall)
11
11
Performs grid search across the `GraphIndexBuilder` parameter space to find
12
12
the best tradeoffs between recall and throughput.
13
13
14
-
This benchmark requires datasets from [https://github.com/erikbern/ann-benchmarks](https://github.com/erikbern/ann-benchmarks/blob/main/README.md#data-sets) to be downloaded to hdf5 and fvec
15
-
directories `hdf5` or `fvec`under the project root depending on the dataset format.
14
+
This benchmark requires `fvecs' versions of datasets from [https://github.com/erikbern/ann-benchmarks](https://github.com/erikbern/ann-benchmarks/blob/main/README.md#data-sets) to be downloaded to `dataset_cache`
15
+
directory under the project root.
16
16
17
17
You can use [`plot_output.py`](./plot_output.py) to graph the [pareto-optimal points](https://en.wikipedia.org/wiki/Pareto_efficiency) found by `Bench`.
0 commit comments