DataMetaMap

Datasets in a shared vector space

DataMetaMap is a Python library for representing datasets in a shared vector space, so you can compare datasets (and tasks) using standard distances and similarity metrics.

It includes multiple dataset embedding algorithms implemented on top of PyTorch:

Dataset2Vec (tabular datasets)
Task2Vec (supervised tasks via Fisher information)
Wasserstein Task Embedding (Optimal Transport based)
MMD (used as a baseline in some workflows)

📬 Assets

💡 Motivation

If you can measure similarity between datasets, you can:

retrieve the most similar dataset(s) to a target dataset
choose better pretraining sources
cluster tasks and datasets, and visualize the dataset landscape
track dataset drift over time

🗃 Algorithms

Maximum Mean Discrepancy, also see 📝 review
Task2Vec, also see 📝 paper
Dataset2Vec, also see 📝 paper
Wasserstein Task Embedding, also see 📝 paper

🛠️ Install

Requires Python 3.10+.

Virtual Environment (venv)

Recommended: install into an isolated virtual environment.

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

Windows (PowerShell):

py -m venv .venv
.\.venv\Scripts\Activate.ps1
py -m pip install -U pip

Install from source

git clone https://github.com/intsystems/DataMetaMap.git
cd DataMetaMap
python -m pip install .

Development install (editable + dev dependencies)

python -m pip install -e ".[dev,viz]"

🚀 Quickstart

Dataset2Vec (tabular)

Dataset2VecEmbedder trains on a collection of tabular datasets, then embeds a single dataset as a vector.

import numpy as np
import torch

from data_meta_map.models import get_model
from data_meta_map.dataset2vec_embedder import Dataset2VecEmbedder

# Model for tabular embedding
model = get_model("dataset2vec")
embedder = Dataset2VecEmbedder(model, max_epochs=1, batch_size=8, n_batches=5)

# Each training dataset: last column is the target
train_ds1 = np.random.randn(64, 6).astype(np.float32)
train_ds2 = np.random.randn(64, 6).astype(np.float32)
embedder.fit([train_ds1, train_ds2])

X = torch.randn(32, 5)
y = torch.randint(0, 2, (32,)).float()
z = embedder.embed(X, y)
print(z.shape)  # (output_size,)

Wasserstein Task Embedding (PyTorch Dataset / DataLoader)

WassersteinEmbedder can compute class statistics from a dataset and build embeddings via a distance matrix. See demo/wasserstein/simple_example1 (1).ipynb for an end-to-end notebook.

Task2Vec (supervised tasks)

Task2Vec computes a task embedding based on the Fisher information of a probe network. See demo/task2vec/simple_example.ipynb for an example workflow.

🎮 Demo

Notebooks are in:

📈 Benchmarks

Benchmark notebooks and scripts live in benchmarks/. In particular, see benchmarks/pretrain_benchmark/ for experiments comparing transfer performance between pretraining sources and target tasks.

👥 Contributors

Vladislav Minashkin (Project planning, Benchmarking, Algorithms)
Papay Ivan (Documentation writing, Code writing, Algorithms)
Meshkov Vlad (Blog post, Demo, Algorithms)
Stepanov Ilya (Tech. report, Code writing, Algorithms)
You are welcome to contribute to our project!

🔗 Useful links

Docs: https://intsystems.github.io/DataMetaMap
Report: report/data_meta_map.pdf

🧪 Development

Run tests:

pytest -q
pytest -q --cov=src/data_meta_map --cov-report=term-missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataMetaMap

📬 Assets

💡 Motivation

🗃 Algorithms

🛠️ Install

Virtual Environment (venv)

Install from source

Development install (editable + dev dependencies)

🚀 Quickstart

Dataset2Vec (tabular)

Wasserstein Task Embedding (PyTorch Dataset / DataLoader)

Task2Vec (supervised tasks)

🎮 Demo

📈 Benchmarks

👥 Contributors

🔗 Useful links

🧪 Development

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DataMetaMap

📬 Assets

💡 Motivation

🗃 Algorithms

🛠️ Install

Virtual Environment (venv)

Install from source

Development install (editable + dev dependencies)

🚀 Quickstart

Dataset2Vec (tabular)

Wasserstein Task Embedding (PyTorch Dataset / DataLoader)

Task2Vec (supervised tasks)

🎮 Demo

📈 Benchmarks

👥 Contributors

🔗 Useful links

🧪 Development