Shorkie is a semi-supervised sequence-to-expression model for yeast: a masked DNA language model pretrained on hundreds of closely related fungal genomes and fine-tuned on thousands of epigenomic and transcriptomic profiles—including a large set of transcriptional-regulator induction RNA-seq experiments generated for this study—to predict RNA-seq coverage and variant effects.
This repository contains shell scripts, notebooks, and command snippets used to reproduce the analyses in the Shorkie paper. These analyses invoke functionality from the baskerville-yeast, and westminster repositories. Please visit those repositories for installation and environment setup instructions.
Contact drk (at) @calicolabs.com, jlinder (at) @calicolabs.com, or kuanhao.chao (at) @gmail.com for questions.
The model weights can be downloaded as .h5 files from the URLs below. We are releasing both Shorkie LM-DNA language model and Shorkie, fine-tuned with thousands of epigenomic and transcriptomic profiles.
Shorkie LM was pretrained on the 165_Saccharomycetales corpus.
To support reproducibility and the Shorkie LM variants introduced in the paper, we also release three companion corpora—R64, 80_strains, and 1341_Fungus—each with raw genomes and matched TFRecords. These corpora span different phylogenetic distances and were used to train additional DNA language model variants.
-
R64: [genomes]
gs://shorkie-paper/data/unsupervised/genome/R64/| [tfrecord]gs://shorkie-paper/data/unsupervised/processed/R64/ -
80_strains: [genomes]
gs://shorkie-paper/data/unsupervised/genome/80_strains/| [tfrecord]gs://shorkie-paper/data/unsupervised/processed/80_strains/ -
165_Saccharomycetales: [genomes]
gs://shorkie-paper/data/unsupervised/genome/165_Saccharomycetales/| [tfrecord]gs://shorkie-paper/data/unsupervised/processed/165_Saccharomycetales/ -
1341_Fungus: [genomes]
gs://shorkie-paper/data/unsupervised/genome/1341_Fungus/| [tfrecord]gs://shorkie-paper/data/unsupervised/processed/1341_Fungus/ -
The training script is at
model/shorkie_lmon GitHub.
Shorkie was fine-tuned from the Shorkie LM using large-scale transcriptomic and epigenomic datasets from S. cerevisiae.
-
Induction Dynamics Gene Expression Atlas (IDEA): RNA-seq induction time-point samples from the Induction Dynamics Gene Expression Atlas (IDEA). New datasets generated by Calico Life Sciences LLC (related to IDEA 1.0; Hackett, S.R. et al., Mol Syst Biol, 2020).
- [Coverage tracks (BigWig)]
gs://shorkie-paper/data/supervised/bigwigs/ - [Processed TFRecords]
gs://shorkie-paper/data/supervised/processed/
- [Coverage tracks (BigWig)]
-
Yeast strain RNA-seq: RNA-seq datasets across diverse S. cerevisiae strains (Caudal, É. et al., Nat Genet, 2024).
-
ChIP-exo & ChIP-MNase: (Rossi, M.J. et al., Nature, 2021).
-
The training script is at
model/shorkieon GitHub.
This section lists external benchmark datasets used to evaluate Shorkie, along with their sources and primary references.
- Dataset: Random Promoter DREAM Challenge MPRA (held-out set; 71,103 promoter sequences spanning eight categories: native promoters, random 80-bp oligos, high-expression, low-expression, “challenging” sequences, SNV perturbations, motif perturbations, and motif tiling).
- Primary reference: Rafi, A. M. et al. “A community effort to optimize sequence-based deep learning models of gene regulation.” Nat Biotechnol (2024).
- Notes: We evaluated Shorkie by replacing MPRA constructs into genomic context upstream of TSSs (details in the paper).
We evaluate Shorkie and compare to DREAM models on two independent yeast cis-eQTL resources:
-
Caudal et al. pan-transcriptome
- Data portal: 1002 Yeast Genomes project
- Primary reference:
- Caudal, É. et al. “Pan-transcriptome reveals a large accessory genome contribution to gene expression variation in yeast.” Nat Genet 56, 1278–1287 (2024).
- Peter, J. et al. “Genome evolution across 1,011 Saccharomyces cerevisiae isolates.” Nature 556, 339–344 (2018).
- Notes: We benchmarked 1,901 local cis-eQTLs from ~1,000 isolates; negative controls were noncoding SNPs matched by allele, TSS distance, and MAF.
-
Kita et al. high-resolution eQTLs
- Supplementary table
- Primary reference: Kita, R. et al. “High-resolution mapping of cis-regulatory variation in budding yeast.” PNAS 114 (2017).
- Notes: We benchmarked 683 variants, stratified into Promoter, UTR5, UTR3, and ORF categories.
The minimal_example/ directory contains a self-contained
script that demonstrates how to load Shorkie and compute a logSED (log₂ Sequence
Effect Difference) score for a single SNP — no fine-tuning required.
-
Download model weights (8 folds):
mkdir -p my_shorkie/train for i in 0 1 2 3 4 5 6 7; do mkdir -p my_shorkie/train/f${i}c0/train wget -O my_shorkie/train/f${i}c0/train/model_best.h5 \ https://storage.googleapis.com/seqnn-share/shorkie/f${i}/model_best.h5 done
-
Provide a yeast genome FASTA + GTF (e.g. S. cerevisiae R64).
python minimal_example/run_shorkie_variant.py \
--model_dir my_shorkie \
--params_file minimal_example/params.json \
--targets_file minimal_example/sheet.txt \
--fasta_file /path/to/genome.fasta \
--gtf_file /path/to/genome.gtf \
--chrom chrI --pos 124373 --ref T --alt C --gene YAL016C-B==================================================
Variant : chrI:124373 T>C
Gene : YAL016C-B
logSED : +0.0557
==================================================
logSED > 0 → alt increases predicted expression
logSED < 0 → alt decreases predicted expression
See minimal_example/README.md for full documentation.
