Skip to content

Latest commit

 

History

History
171 lines (129 loc) · 6.13 KB

File metadata and controls

171 lines (129 loc) · 6.13 KB
DataMetaMap

DataMetaMap

Datasets in a shared vector space

Coverage_2 Coverage Docs

License GitHub Contributors Issues GitHub Pull Requests

DataMetaMap is a Python library for representing datasets in a shared vector space, so you can compare datasets (and tasks) using standard distances and similarity metrics.

It includes multiple dataset embedding algorithms implemented on top of PyTorch:

  • Dataset2Vec (tabular datasets)
  • Task2Vec (supervised tasks via Fisher information)
  • Wasserstein Task Embedding (Optimal Transport based)
  • MMD (used as a baseline in some workflows)

📬 Assets

  1. Technical Meeting 1 - Presentation
  2. Blog Post
  3. Technical Report

💡 Motivation

If you can measure similarity between datasets, you can:

  • retrieve the most similar dataset(s) to a target dataset
  • choose better pretraining sources
  • cluster tasks and datasets, and visualize the dataset landscape
  • track dataset drift over time

🗃 Algorithms

🛠️ Install

Requires Python 3.10+.

Virtual Environment (venv)

Recommended: install into an isolated virtual environment.

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

Windows (PowerShell):

py -m venv .venv
.\.venv\Scripts\Activate.ps1
py -m pip install -U pip

Install from source

git clone https://github.com/intsystems/DataMetaMap.git
cd DataMetaMap
python -m pip install .

Development install (editable + dev dependencies)

python -m pip install -e ".[dev,viz]"

🚀 Quickstart

Dataset2Vec (tabular)

Dataset2VecEmbedder trains on a collection of tabular datasets, then embeds a single dataset as a vector.

import numpy as np
import torch

from data_meta_map.models import get_model
from data_meta_map.dataset2vec_embedder import Dataset2VecEmbedder

# Model for tabular embedding
model = get_model("dataset2vec")
embedder = Dataset2VecEmbedder(model, max_epochs=1, batch_size=8, n_batches=5)

# Each training dataset: last column is the target
train_ds1 = np.random.randn(64, 6).astype(np.float32)
train_ds2 = np.random.randn(64, 6).astype(np.float32)
embedder.fit([train_ds1, train_ds2])

X = torch.randn(32, 5)
y = torch.randint(0, 2, (32,)).float()
z = embedder.embed(X, y)
print(z.shape)  # (output_size,)

Wasserstein Task Embedding (PyTorch Dataset / DataLoader)

WassersteinEmbedder can compute class statistics from a dataset and build embeddings via a distance matrix. See demo/wasserstein/simple_example1 (1).ipynb for an end-to-end notebook.

Task2Vec (supervised tasks)

Task2Vec computes a task embedding based on the Fisher information of a probe network. See demo/task2vec/simple_example.ipynb for an example workflow.

🎮 Demo

Notebooks are in:

📈 Benchmarks

Benchmark notebooks and scripts live in benchmarks/. In particular, see benchmarks/pretrain_benchmark/ for experiments comparing transfer performance between pretraining sources and target tasks.

👥 Contributors

  • Vladislav Minashkin (Project planning, Benchmarking, Algorithms)
  • Papay Ivan (Documentation writing, Code writing, Algorithms)
  • Meshkov Vlad (Blog post, Demo, Algorithms)
  • Stepanov Ilya (Tech. report, Code writing, Algorithms)
  • You are welcome to contribute to our project!

🔗 Useful links

🧪 Development

Run tests:

pytest -q
pytest -q --cov=src/data_meta_map --cov-report=term-missing