Impresso Multilingual Text Embedder

This repository offers tools for embedding texts in multiple languages with an efficient workflow. It uses the transformers library by Hugging Face and the make tool to manage large datasets. Make ensures robust and incremental processing, allowing you to handle new data, resume tasks, and run processes across different machines, all while avoiding redundant work.

The embedder supports:

Text-level embeddings (full page / full article)
Sentence-level embeddings
Chunk-level embeddings

make ensures reliable, incremental, and resumable processing across machines and environments. All outputs can be uploaded safely to S3 or kept locally.

Features

Efficient Storage Management: Minimal local storage is required as necessary data for each year of a newspaper are downloaded on-the-fly and truncated after uploading.
Parallel Processing: Processes run in parallel to optimize throughput.
Selective Processing: Only the necessary processing steps are executed, ensuring efficiency by not reprocessing existing outputs on S3.
S3 Integration: Integration with S3 for storing and resuming processing. The system ensures no overwriting of files or partial uploads due to interruptions. It is also possible to run everything locally without S3.
Custom Embedding Options: Flexible configurations via normal environment variables or make variables, including the ability to specify model versions and filter text data.

Missing Features (Upcoming)

Batch processing of texts is not yet implemented. This will be added in a future release.
Installing specialized xformer implementation for sparse attention inference is not yet implemented. This should be added in a future release for faster inference.

Concepts

Storage Layout

Local Storage

Temporary workspace used to mirror S3 structure.

S3 Storage

Permanent storage for processed embeddings (optional).

s3://$OUT_S3_BUCKET_PROCESSED_DATA/textembeddings-<MODEL>-<VERSION>/<PROVIDER>/<NEWSPAPER>/<NEWSPAPER-YEAR>.jsonl.bz2

File Stamps

make uses local stamp files to track progress:

.stamp files indicate S3 input availability
.done files indicate processed outputs

A helper script lib/sync_s3_filestamps.py keeps stamps synced with S3.

File Organization

The processing follows a structured organization:

S3 Buckets: Data is organized by processing steps and, in some cases, versions.
Build Directory: A local mirror of the S3 storage, structured similarly for consistency.

# Example directory structure
BUILD_DIR/BUCKET/PROVIDER/NEWSPAPER/<NEWSPAPER-YEAR>.jsonl.bz2
BUILD_DIR/BUCKET/PROCESSING_TYPE-VERSION/PROVIDER/NEWSPAPER/<NEWSPAPER-YEAR>.jsonl.bz2

Example

BUILD_DIR/
  # input bucket
  000-processing-test-samples/
    lingproc/lingproc-test-v1.0.0/
      SNL/
        EXP/
          EXP-1912.jsonl.bz2
  # output bucket
  140-processed-data-sandbox/
    textembeddings-gte-multilingual-base-v1.0.1/
        SNL/
          EXP/
            EXP-1912.jsonl.bz2

Setup

1. Clone the Repository

git clone git@github.com:impresso/impresso-text-embedder.git
cd impresso-text-embedder

2. Configure S3 Credentials

cp dotenv.sample .env

Edit .env:

SE_ACCESS_KEY=<your-access-key>
SE_SECRET_KEY=<your-secret-key>
SE_HOST_URL=<host>

3. Install Dependencies

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip3 install -r requirements.txt

4. Setup Local Configuration

cp local.config.sample.mk config.local.mk

You must now edit config.local.mk (full guide below).

5. Create Directories & Download Model

make setup

Running the Embedder

make newspaper     # process according to list in NEWSPAPER_LIST_FILE
make each          # process all newspapers in parallel
make help          # list all targets

Embedding Modes

Set the embedding mode in config.local.mk:

EMBEDDING_LEVEL_OPTION := sentence
# or: text, chunk

1. Text-Level Embedding

Embeds each full text object (page/article).

Fastest mode.
Best for document retrieval.
No segmentation but texts below EMBEDDING_MIN_CHAR_LENGTH could be skipped.

2. Sentence-Level Embedding

Embeds each sentence individually.

Uses a multilingual sentence segmenter.
Short sentences (< EMBEDDING_MIN_CHAR_LENGTH) could be skipped.
Best for search, QA, and fine-grained retrieval.

3. Chunk-Level Embedding

Embeds long texts split into fixed-size chunks.

Prevents losing context when texts exceed model max length.
Produces stable embeddings for long documents.

Configuration Guide (`config.local.mk`)

Below is an example config file you must customize:

# config.local.mk (example)

$(info Make: Including config.local.mk: $(shell readlink -f config.local.mk))

BUILD_DIR := build.d

# Model
CREATOR_NAME := Alibaba-NLP
HF_MODEL_NAME := gte-multilingual-base
HF_MODEL_VERSION := f7d567e
HF_FULL_MODEL_NAME := $(CREATOR_NAME)/$(HF_MODEL_NAME)

# Embedding level: text | sentence | chunk
EMBEDDING_LEVEL_OPTION := chunk

# Storage (input)
IN_S3_PREFIXES := lingproc/lingproc-test-v1.0.0
IN_S3_BUCKET_REBUILT := 000-processing-test-samples

# Storage (output)
OUT_S3_BUCKET_PROCESSED_DATA := 140-processed-data-sandbox
OUT_S3_PROCESSED_INFIX := textembeddings-$(HF_MODEL_NAME)
OUT_S3_PROCESSED_VERSION := v1.0.1

# Text filtering
EMBEDDING_MIN_CHAR_LENGTH := 10

# Local model cache
HF_HOME := ./hf.d

# Parallel processing
MAKE_PARALLEL_OPTION := --jobs 2

# Optional: Filter newspapers
PROVIDER  := SNL
NEWSPAPER := EXP

# Logging
LOGGING_LEVEL := WARNING

Data Flow Overview

flowchart LR

    subgraph cluster_local ["Local Machine"]
        style cluster_local fill:#FFF3E0,stroke:#FF6F00,stroke-width:1px

        F{{"Text Embedding Processor"}}
        B[("Local Rebuilt Data")]
        E[("Text Embedding Model")]
        C[("Processed Output Data")]

    end

    subgraph cluster_s3 ["S3 Storage"]
        style cluster_s3 fill:#E0F7FA,stroke:#0097A7,stroke-width:1px

        A[/"Rebuilt Data"/]
        D[/"Processed Data"/]

        A -->|Sync Input| B
        B -->|Data| F
        E -->|Model| F
        F -->|Output| C
        C -->|Upload Output| D
        D -->|Sync Output| C
    end

About

Impresso

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585) and the Luxembourg National Research Fund under grant No. 17498891.

Copyrights

Copyright (C) 2018-2024 The Impresso team.
Contributors to this program include: Simon Clematide

License

This program is provided as open source under the GNU Affero General Public License v3 or later.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
lib		lib
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
Makefile_OLD		Makefile_OLD
Pipfile		Pipfile
README-backup.md		README-backup.md
README.md		README.md
config.local.sample.mk		config.local.sample.mk
dotenv.sample		dotenv.sample
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Impresso Multilingual Text Embedder

Features

Missing Features (Upcoming)

Concepts

Storage Layout

Local Storage

S3 Storage

File Stamps

File Organization

Setup

1. Clone the Repository

2. Configure S3 Credentials

3. Install Dependencies

4. Setup Local Configuration

5. Create Directories & Download Model

Running the Embedder

Embedding Modes

1. Text-Level Embedding

2. Sentence-Level Embedding

3. Chunk-Level Embedding

Configuration Guide (`config.local.mk`)

Data Flow Overview

About

Impresso

Copyrights

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Impresso Multilingual Text Embedder

Features

Missing Features (Upcoming)

Concepts

Storage Layout

Local Storage

S3 Storage

File Stamps

File Organization

Setup

1. Clone the Repository

2. Configure S3 Credentials

3. Install Dependencies

4. Setup Local Configuration

5. Create Directories & Download Model

Running the Embedder

Embedding Modes

1. Text-Level Embedding

2. Sentence-Level Embedding

3. Chunk-Level Embedding

Configuration Guide (config.local.mk)

Data Flow Overview

About

Impresso

Copyrights

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Configuration Guide (`config.local.mk`)

Packages