Skip to content

Commit 8c75f1b

Browse files
jshooktlwillke
andauthored
Catalog-driven dataset loader (#654)
- Introduce a simplified catalog-driven dataset loader and fully switch benchmarking to use it. - Disable/remove legacy dataset loaders and legacy scrubbing behavior; dataset metadata now uses NO_SCRUB. - Add transport-routed remote loading with S3 support, shared client reuse, and parallel base/query/gt downloads. - Support local catalog auto-discovery, base_url indirection, and YAML-driven dataset configuration. - Cache included remote catalogs locally so previously downloaded remote datasets remain usable offline, while preserving local catalog override precedence. - Improve loader robustness with better error handling, safer logging redaction, and more reliable dataset metadata path resolution across working directories. - Expand test coverage for remote catalog loading, local-vs-remote precedence, offline cached-catalog behavior, and Windows env-var handling. - Refactor dataset catalog/layout conventions, including public/protected cache subdirectories under dataset_cache. - Refresh loader and dataset documentation, including local/remote behavior, benchmarking paths, and catalog examples. - Clean up supporting project configuration, including .gitignore, RAT excludes, GitHub Actions dataset secret handling, and related benchmark YAML organization. --------- Co-authored-by: Ted Willke <ted.willke@gmail.com>
1 parent 6fa6278 commit 8c75f1b

41 files changed

Lines changed: 2925 additions & 682 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/run-bench.yml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,20 @@ jobs:
126126
ref: ${{ matrix.branch }}
127127
fetch-depth: 0
128128

129+
# ==========================================
130+
# Decode and write the protected dataset catalog
131+
#
132+
# TO UPDATE THIS SECRET:
133+
# 1. On your local machine, run:
134+
# base64 -i jvector-examples/yaml-configs/dataset-catalogs/protected-catalog.yaml
135+
# 2. Go to GitHub Repo -> Settings -> Secrets and variables -> Actions
136+
# 3. Update the PROTECTED_CATALOG_YAML secret with the new Base64 string.
137+
# ==========================================
138+
- name: Inject Protected Catalog
139+
run: |
140+
mkdir -p jvector-examples/yaml-configs/dataset-catalogs
141+
echo "${{ secrets.PROTECTED_CATALOG_YAML }}" | base64 -d > jvector-examples/yaml-configs/dataset-catalogs/protected-catalog.yaml
142+
129143
# Create a directory to store benchmark results
130144
- name: Create results directory
131145
run: mkdir -p benchmark_results
@@ -137,8 +151,6 @@ jobs:
137151
# Run the benchmark if jvector-examples exists
138152
- name: Run benchmark
139153
id: run-benchmark
140-
env:
141-
DATASET_HASH: ${{ secrets.DATASETS_KEYPATH }}
142154
run: |
143155
# Check if jvector-examples directory and AutoBenchYAML class exist
144156
if [ ! -d "jvector-examples" ]; then

.gitignore

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,19 @@ local/
33
.mvn/wrapper/maven-wrapper.jar
44
.java-version
55
.bob/
6+
dataset_
7+
**/local_datasets/**
68

79
### Bench caches
810
pq_cache/
911
index_cache/
12+
dataset_cache/
13+
14+
### Data catalogs
15+
jvector-examples/yaml-configs/dataset-catalogs/*.yaml
16+
jvector-examples/yaml-configs/dataset-catalogs/*.yml
17+
!jvector-examples/yaml-configs/dataset-catalogs/public-catalog.yaml
18+
jvector-examples/yaml-configs/dataset-catalogs/.catalog-cache/
1019

1120
### Logging (or whatever you use)
1221
logging/
@@ -49,3 +58,5 @@ hdf5/
4958
# JMH generated files
5059
dependency-reduced-pom.xml
5160
results.csv
61+
**/datasets/custom/**
62+
**/dataset_cache/**

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ You may also use method-level filtering and patterns, e.g.,
7474
(The `failIfNoSpecifiedTests` option works around a quirk of surefire: it is happy to run `test` with submodules with empty test sets,
7575
but as soon as you supply a filter, it wants at least one match in every submodule.)
7676

77-
You can run `SiftSmall` and `Bench` directly to get an idea of what all is going on here. `Bench` will automatically download required datasets to the `fvec` and `hdf5` directories.
77+
You can run `SiftSmall` and `Bench` directly to get an idea of what all is going on here. `Bench` will automatically download required datasets to the `dataset_cache` directory.
7878
The files used by `SiftSmall` can be found in the [siftsmall directory](./siftsmall) in the project root.
7979

8080
To run either class, you can use the Maven exec-plugin via the following incantations:

docs/benchmarking.md

Lines changed: 42 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,19 @@ JVector comes with a built-in benchmarking system in `jvector-examples/.../Bench
44

55
To run a benchmark
66
- Decide which dataset(s) you want to benchmark. A dataset consists of
7-
- The vectors to be indexed, usually called the "base" or "target" vectors.
8-
- The query vectors.
9-
- The "ground truth" results which are used to compute accuracy metrics.
10-
- The similarity metric which should have been used to compute the ground truth (dot product, cosine similarity or L2 distance).
11-
- Configure the parameters combinations for which you want to run the benchmark. This includes graph index parameters, quantization parameters and search parameters.
7+
- The vectors to be indexed, usually called the "base" or "target" vectors
8+
- The query vectors
9+
- The "ground truth" results that are used to compute accuracy metrics
10+
- The similarity metric used compute the ground truth (dot product, cosine similarity or L2 distance)
11+
- Configure the parameters combinations for which you want to run the benchmark. This includes index construction parameters, quantization parameters and search parameters.
1212

13-
JVector supports two types of datasets:
14-
- **Fvec/Ivec**: The dataset consists of three files, for example `base.fvec`, `queries.fvec` and `neighbors.ivec` containing the base vectors, query vectors, and ground truth. (`fvec` and `ivec` file formats are described [here](http://corpus-texmex.irisa.fr/))
15-
- **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth.
13+
JVector supports datasets in the fvecs/ivecs format. These consist of three files, for example `base.fvecs`, `queries.fvecs` and `neighbors.ivecs` containing the base vectors, query vectors, and ground truth. (`fvecs` and `ivecs` file formats are described [here](http://corpus-texmex.irisa.fr/))
1614

1715
The general procedure for running benchmarks is mentioned below. The following sections describe the process in more detail.
1816
- [Specify the dataset](#specifying-datasets) names to benchmark in `datasets.yml`.
1917
- Certain datasets will be downloaded automatically. If using a different dataset, make sure the dataset files are downloaded and made available (refer the section on [Custom datasets](#custom-datasets)).
20-
- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets to be benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<your-dataset-name>.yml` in the same folder.
21-
- Decide on the kind of measurements and logging you want and configure them in `run.yml`.
18+
- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<your-dataset-name>.yml` in the `index-parameters` subfolder.
19+
- Decide on the kind of measurements and logging you want and configure them in `run-config.yml`.
2220

2321
You can run the configured benchmark with maven:
2422
```sh
@@ -31,31 +29,28 @@ The datasets you want to benchmark should be specified in `jvector-examples/yaml
3129

3230
To benchmark a single dataset, comment out the entries corresponding to all other datasets. (Or provide command line arguments as described in [Running `bench` from the command line](#running-bench-from-the-command-line))
3331

34-
Datasets are assumed to be Fvec/Ivec based unless the entry in the `datasets.yml` ends with `.hdf5`. In this case, `.hdf5` is not considered part of the "dataset name" referenced in other sections.
32+
Datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.
3533

36-
You'll notice that datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.
37-
38-
For HDF5 files, the substrings `-angular`, `-euclidean` and `-dot` correspond to cosine similarity, L2 distance, and dot product similarity functions (these substrings ARE considered to be part of the "dataset name"). Currently, Fvec/Ivec datasets are implicitly assumed to use cosine similarity (changing this requires editing `DataSetLoaderMFD.java`).
34+
Dataset similarity functions are configured in `jvector-examples/yaml-configs/dataset-metadata.yml`.
3935

4036
Example `datasets.yml`:
4137

4238
```yaml
4339
category0:
44-
- my-fvec-dataset # fvec/ivec dataset, cosine similarity
45-
- my-hdf5-dataset-angular.hdf5 # hdf5 dataset, cosine similarity
40+
- my-dataset-a
41+
- my-dataset-b
4642
some-other-category:
47-
- a-huge-dataset-1024d-euclidean.hdf5 # hdf5 dataset, L2 similarity
48-
- my-simple-dataset-dot.hdf5 # hdf5 dataset, dot product similarity
49-
- some-dataset-euclidean # fvec/ivec dataset, cosine similarity (NOT L2 unless you change the code!)
43+
- another-dataset-a
44+
- another-dataset-b
5045
```
5146
5247
## Setting benchmark parameters
5348
5449
### default.yml / \<dataset-name\>.yml
5550
56-
`jvector-examples/yaml-configs/default.yml` specifies the default index construction and search parameters to be used by `bench` for all datasets.
51+
`jvector-examples/yaml-configs/index-parameters/default.yml` specifies the default index construction and search parameters to be used by `bench` for all datasets.
5752

58-
You can specify a custom set of a parameters for any given dataset by creating a file called `<dataset-name>.yml`, with `<dataset-name>` replaced by the actual name of the dataset. This is the same as the identifier used in `datasets.yml`, but without the `.hdf5` suffix for hdf5 datasets. The format of this file is exactly the same as `default.yml`.
53+
You can specify a custom set of a parameters for any given dataset by creating a file called `<dataset-name>.yml`, with `<dataset-name>` replaced by the actual name of the dataset. This is the same as the identifier used in `datasets.yml`. The format of this file is exactly the same as `default.yml`.
5954

6055
Refer to `default.yml` for a list of all options.
6156

@@ -67,15 +62,15 @@ construction:
6762
```
6863
will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is particularly useful when running a Grid search to identify the best performing parameters.
6964

70-
### run.yml
65+
### run-config.yml
7166

7267
This file contains configurations for
7368
- Specifying the measurements you want to report, like QPS, latency and recall
7469
- Specifying where to output these measurements, i.e. to the console, or to a file, or both.
7570

7671
The configurations in this file are "run-level", meaning that they are shared across all the datasets being benchmarked.
7772

78-
See `run.yml` for a full list of all options.
73+
See `run-config.yml` for a full list of all options.
7974

8075
## Running `bench` from the command line
8176

@@ -86,45 +81,37 @@ mvn compile exec:exec@bench -pl jvector-examples -am
8681

8782
To benchmark a subset of the datasets in `datasets.yml`, you can provide a space-separated list of regexes as arguments.
8883
```sh
89-
# matches `glove-25-angular.hdf5`, `glove-50-angular.hdf5`, `nytimes-256-angular.hdf5` etc
84+
# matches `glove-25-angular`, `glove-50-angular`, `nytimes-256-angular` etc
9085
mvn compile exec:exec@bench -pl jvector-examples -am -DbenchArgs="glove nytimes"
9186
```
9287

9388
## Custom Datasets
9489

95-
### Custom Fvec/Ivec datasets
96-
97-
Using fvec/ivec datasets requires them to be configured in `DataSetLoaderMFD.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.
98-
99-
To use a custom dataset consisting of files `base.fvec`, `queries.fvec` and `neighbors.ivec`, do the following:
100-
- Ensure that you have three files:
101-
- `base.fvec` containing N D-dimensional float vectors. These are used to build the index.
102-
- `queries.fvec` containing Q D-dimensional float vectors. These are used for querying the built index.
103-
- `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
104-
The files can be named however you like.
105-
- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
106-
- Edit `DataSetLoaderMFD.java` to configure a new dataset and it's associated files:
107-
```java
108-
put("cust-ds", new MultiFileDatasource("cust-ds",
109-
"cust-ds/base.fvec",
110-
"cust-ds/query.fvec",
111-
"cust-ds/neighbors.ivec"));
90+
Datasets are configured via YAML catalog files under `jvector-examples/yaml-configs/dataset-catalogs/`. The loader recursively discovers all `.yaml`/`.yml` files in that directory tree. See `jvector-examples/yaml-configs/dataset-catalogs/local-catalog.yaml` for the full format reference.
91+
92+
To add a custom fvecs/ivecs dataset:
93+
94+
1. Add a `.yaml` file to the YAML catalog directory, mapping your dataset name to its files:
95+
```yaml
96+
_defaults:
97+
cache_dir: ${DATASET_CACHE_DIR:-dataset_cache}
98+
99+
my-dataset:
100+
base: my_base_vectors.fvecs
101+
query: my_query_vectors.fvecs
102+
gt: my_ground_truth.ivecs
103+
```
104+
2. Place your fvecs/ivecs files at the paths you specified in the YAML (or specify a `cache_dir` / `base_url` to fetch them from a remote source).
105+
3. Add the dataset's similarity function to `jvector-examples/yaml-configs/dataset-metadata.yml`:
106+
```yaml
107+
my-dataset:
108+
similarity_function: COSINE
109+
load_behavior: NO_SCRUB
112110
```
113-
The file paths are resolved relative to the `fvec` directory. `cust-ds` is the name of the dataset and can be changed to whatever is appropriate.
114-
- In `jvector-examples/yaml-configs/datasets.yml`, add an entry corresponding to your custom dataset. Comment out other datasets which you do not want to benchmark.
111+
4. Add the dataset name to `jvector-examples/yaml-configs/datasets.yml` so BenchYAML can find it:
115112
```yaml
116113
custom:
117-
- cust-ds
114+
- my-dataset
118115
```
119116

120-
## Custom HDF5 datasets
121-
122-
HDF5 datasets consist of a single file. The Hdf5Loader looks for three HDF5 datasets within the file, `train`, `test` and `neighbors`. These correspond to the base, query and neighbors vectors described above for fvec/ivec files.
123-
124-
To use an HDF5 dataset, edit `jvector-examples/yaml-configs/datasets.yml` to add an entry like the following:
125-
```yaml
126-
category:
127-
- <dataset-name>.hdf5
128-
```
129-
130-
BenchYAML looks for hdf5 datasets with the name `<dataset-name>.hdf5` in the `hdf5` folder in the root of this repo. If the file doesn't exist, BenchYAML will attempt to automatically download the dataset from ann-benchmarks.com. If your dataset is not from ann-benchmarks.com, simply ensure that the dataset is available in the `hdf5` folder and edit `datasets.yml` accordingly.
117+
For remote datasets, use `base_url` to specify where files should be downloaded from. The `${VAR}` and `${VAR:-default}` syntax is supported for environment variable expansion. See the example config for details.

jvector-examples/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ A simple benchmark for the sift dataset located in the [siftsmall](./siftsmall)
1111
Performs grid search across the `GraphIndexBuilder` parameter space to find
1212
the best tradeoffs between recall and throughput.
1313

14-
This benchmark requires datasets from [https://github.com/erikbern/ann-benchmarks](https://github.com/erikbern/ann-benchmarks/blob/main/README.md#data-sets) to be downloaded to hdf5 and fvec
15-
directories `hdf5` or `fvec` under the project root depending on the dataset format.
14+
This benchmark requires `fvecs' versions of datasets from [https://github.com/erikbern/ann-benchmarks](https://github.com/erikbern/ann-benchmarks/blob/main/README.md#data-sets) to be downloaded to `dataset_cache`
15+
directory under the project root.
1616

1717
You can use [`plot_output.py`](./plot_output.py) to graph the [pareto-optimal points](https://en.wikipedia.org/wiki/Pareto_efficiency) found by `Bench`.
1818

jvector-examples/src/main/java/io/github/jbellis/jvector/example/BenchYAML.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,11 +94,11 @@ public static void main(String[] args) throws IOException {
9494
RunConfig runCfg = RunConfig.loadDefault();
9595
artifacts = RunArtifacts.open(runCfg, allConfigs);
9696
} catch (java.io.FileNotFoundException e) {
97-
// Legacy yamlSchemaVersion "0" behavior: no run.yml
97+
// Legacy yamlSchemaVersion "0" behavior: no run-config.yml
9898
// - logging disabled
9999
// - console shows compute selection
100100
// - compute selection comes from legacy search.benchmarks if present, else default
101-
System.err.println("WARNING: run.yml not found. Falling back to deprecated legacy behavior: "
101+
System.err.println("WARNING: run-config.yml not found. Falling back to deprecated legacy behavior: "
102102
+ "no logging, console mirrors computed benchmarks.");
103103

104104
Map<String, List<String>> legacyBenchmarks = null;

jvector-examples/src/main/java/io/github/jbellis/jvector/example/HelloVectorWorld.java

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
package io.github.jbellis.jvector.example;
1818

19-
import io.github.jbellis.jvector.example.benchmarks.datasets.DataSetLoaderMFD;
19+
import io.github.jbellis.jvector.example.benchmarks.datasets.DataSets;
2020
import io.github.jbellis.jvector.example.reporting.RunArtifacts;
2121
import io.github.jbellis.jvector.example.yaml.MultiConfig;
2222
import io.github.jbellis.jvector.example.yaml.RunConfig;
@@ -36,9 +36,8 @@ public static void main(String[] args) throws IOException {
3636
// Run-level policy config (benchmarks/console/logging + run metadata)
3737
RunConfig runCfg = RunConfig.loadDefault();
3838

39-
// Load dataset
40-
var ds = new DataSetLoaderMFD().loadDataSet(datasetName)
41-
.orElseThrow(() -> new RuntimeException("dataset " + datasetName + " not found"))
39+
var ds = DataSets.loadDataSet(datasetName).orElseThrow(
40+
() -> new RuntimeException("dataset " + datasetName + " not found"))
4241
.getDataSet();
4342

4443
// Run artifacts + selections (sys_info/dataset_info/experiments.csv)

0 commit comments

Comments
 (0)