Annotate proteomes with ESMC (Biohub) protein-language-model embeddings and SAE interpretability features, then explore the proteome with UMAP.
Built for Octopus chierchiae (Baserow database 206 / table 1026,
Transcripts) but configuration-driven so additional proteomes are just another
YAML file.
- Fetch protein sequences + metadata (
gene_id,transcript_id,chromosome,start,end,strand,Ochierchiae_name) from Baserow. - Embed each sequence with ESMC (default
esmc-6b-2024-12) via the Biohub API, storing one mean-pooled vector per protein. - Persist the embedding back to Baserow (new columns) and a local parquet cache, so re-runs and analysis never re-spend Biohub tokens.
- SAE (opt-in): when
sae.modelsis set, the top-K high-scoring SAE features per protein are computed in the same Biohub call as the embedding (no extra token cost) and written to thesaecolumn. The published SAEs foresmc-6b-2024-12areesmc-6b-2024-12-sae-layer60-k64-codebook16384(16k features, used in the Atlas) and...-codebook65536(65k, finer). The standalonesaecommand remains for back-filling SAE onto proteins embedded before it was enabled. - Explore the whole proteome with UMAP + interactive plots, colored by any metadata column.
Biohub calls are the limiting resource, so the pipeline:
- only embeds proteins missing an embedding (checks both cache and Baserow);
- checkpoints to cache + Baserow every chunk (
run.batch_size, default 32) — an interrupted or stalled run resumes for free and loses at most one chunk; - runs requests concurrently but bounded (
run.max_workers, default 16) with a tight per-request timeout (esmc.request_timeout), so a few slow/stuck requests can't freeze the run; - offers
--dry-runto report exactly how many proteins a real run would embed; - offers
--limit Nto cap a first trial run, and--no-saefor a cheaper embed-only pass.
mamba env create -f environment.yml # or: conda env create -f environment.yml
conda activate och_annotateProvide credentials (never committed) via the environment or a local .env
(copy from .env.example):
export BASEROW_TOKEN=... # read + create-field/edit-row rights on table 1026
export BIOHUB_API_TOKEN=... # ESMC API token (ESM_API_KEY also accepted)Network note (Claude Code on the web): the default web environment's network policy only allows pypi/npm/github. To run the live pipeline from a web session, recreate the environment with a custom egress allowlist that adds
baserow.gofflab.organdbiohub.ai. Locally / on HPC there is no such restriction.
# Preview workload without spending any Biohub tokens
och-annotate embed -c config/octopus_chierchiae.yaml --dry-run
# Embed (resumable, writes to Baserow + local cache)
och-annotate embed -c config/octopus_chierchiae.yaml
# Try a handful first
och-annotate embed -c config/octopus_chierchiae.yaml --limit 10
# Embed-only variant (cheaper): skip SAE even if it's configured, then explore,
# then back-fill SAE later when the Biohub budget allows.
och-annotate embed -c config/octopus_chierchiae.yaml --no-sae
och-annotate umap -c config/octopus_chierchiae.yaml --color chromosome --out umap.html
och-annotate sae -c config/octopus_chierchiae.yaml # back-fill SAE onto embedded proteins
# Override SAE / concurrency settings per run (no need to edit the YAML)
och-annotate embed -c config/octopus_chierchiae.yaml --top-k 128
och-annotate embed -c config/octopus_chierchiae.yaml --max-workers 8
och-annotate embed -c config/octopus_chierchiae.yaml \
--sae-model esmc-6b-2024-12-sae-layer60-k64-codebook65536 --top-k 100
# Top-K SAE features (after setting sae.models in the config)
och-annotate sae -c config/octopus_chierchiae.yaml
# UMAP exploration -> interactive HTML
och-annotate umap -c config/octopus_chierchiae.yaml --color chromosome --out umap.html
# Color the UMAP by a single SAE feature's activation (0 where not in a protein's top-K)
och-annotate umap -c config/octopus_chierchiae.yaml --sae-feature 8523 --out feat8523.html
# Inspect Baserow columns
och-annotate fields -c config/octopus_chierchiae.yamlOr use notebooks/explore_proteome.ipynb for interactive exploration, including
Leiden clustering + SAE-feature enrichment (proteins as "cells", SAE features
as "genes" — the scanpy marker-gene workflow):
pip install -e ".[cluster]" # scanpy, leidenalg, igraphCluster on the ESMC embedding kNN graph (build_anndata → sc.pp.neighbors(use_rep="X_esmc")
→ sc.tl.leiden), then sae_enrichment(adata) ranks the SAE features enriched in
each cluster (Wilcoxon, FDR-corrected). Helpers live in analysis.py
(build_anndata, sae_feature_matrix, sae_enrichment).
config/ # one YAML per proteome (Baserow ids, model, SAE, run opts)
src/och_annotate/
config.py # typed config + env-token resolution
baserow.py # Baserow REST client (read rows, ensure columns, write back)
esmc.py # ESMC SDK wrapper: mean-pool embeddings + top-K SAE features
cache.py # parquet embedding cache (idempotency)
pipeline.py # fetch -> embed -> Baserow + cache (resumable, frugal)
sae.py # separate SAE feature-extraction step
analysis.py # load embeddings -> UMAP -> interactive plot
cli.py # `och-annotate` command-line interface
notebooks/explore_proteome.ipynb
tests/ # mocked unit tests (no live API / no torch needed)
Biohub bills logits at seq_len tokens (1 credit = 10,000 tokens); encode is
free. An embedding and a combined embedding+SAE call cost the same (one
logits pass), so always run them together — splitting them doubles the cost.
For the O. chierchiae proteome (~35.8k proteins, mean 463 aa) the whole embed+SAE job is ~1,657 credits, one-time. On a 100-credit/day account that's ~16 days of daily resumes (~2,100 proteins/day); with a higher daily limit it's hours.
The pipeline is built for this: each run resumes (skipping anything already done
in Baserow) and stops early the moment the daily credit limit is hit, so a
daily run is fast and safe. .github/workflows/daily-embed.yml runs it on a
daily cron — add BASEROW_TOKEN and BIOHUB_API_TOKEN as repository secrets,
and adjust the cron time to land just after your quota resets. Trigger it
manually (Actions → Daily ESMC embed → Run workflow) right after a limit bump
to drain the backlog at once.
Copy config/octopus_chierchiae.yaml, change name, the Baserow
database_id/table_id, the column names, and (optionally) the model. Then run
the same commands against the new config.
| column | type | contents |
|---|---|---|
esmc_embedding |
long text | JSON list[float] mean-pooled vector |
esmc_model |
long text | model id that produced the embedding |
esmc_embedded_at |
long text | ISO-8601 UTC timestamp |
sae_top_features |
long text | JSON {sae_model: {indices, activations}} |
pip install -e ".[dev]"
pytest # unit tests run fully offline (esm/torch mocked)