Skip to content

calico/shorkie-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shorkie logo

Shorkie - Predicting dynamic expression patterns in budding yeast with a fungal DNA language model

Shorkie is a semi-supervised sequence-to-expression model for yeast: a masked DNA language model pretrained on hundreds of closely related fungal genomes and fine-tuned on thousands of epigenomic and transcriptomic profiles—including a large set of transcriptional-regulator induction RNA-seq experiments generated for this study—to predict RNA-seq coverage and variant effects.

This repository contains shell scripts, notebooks, and command snippets used to reproduce the analyses in the Shorkie paper. These analyses invoke functionality from the baskerville-yeast, and westminster repositories. Please visit those repositories for installation and environment setup instructions.

Contact drk (at) @calicolabs.com, jlinder (at) @calicolabs.com, or kuanhao.chao (at) @gmail.com for questions.


Model Availability

The model weights can be downloaded as .h5 files from the URLs below. We are releasing both Shorkie LM-DNA language model and Shorkie, fine-tuned with thousands of epigenomic and transcriptomic profiles.


Training Data Availability

Shorkie LM

Shorkie LM was pretrained on the 165_Saccharomycetales corpus.
To support reproducibility and the Shorkie LM variants introduced in the paper, we also release three companion corpora—R64, 80_strains, and 1341_Fungus—each with raw genomes and matched TFRecords. These corpora span different phylogenetic distances and were used to train additional DNA language model variants.

  • R64: [genomes] gs://shorkie-paper/data/unsupervised/genome/R64/ | [tfrecord] gs://shorkie-paper/data/unsupervised/processed/R64/

  • 80_strains: [genomes] gs://shorkie-paper/data/unsupervised/genome/80_strains/ | [tfrecord] gs://shorkie-paper/data/unsupervised/processed/80_strains/

  • 165_Saccharomycetales: [genomes] gs://shorkie-paper/data/unsupervised/genome/165_Saccharomycetales/ | [tfrecord] gs://shorkie-paper/data/unsupervised/processed/165_Saccharomycetales/

  • 1341_Fungus: [genomes] gs://shorkie-paper/data/unsupervised/genome/1341_Fungus/ | [tfrecord] gs://shorkie-paper/data/unsupervised/processed/1341_Fungus/

  • The training script is at model/shorkie_lm on GitHub.

Shorkie

Shorkie was fine-tuned from the Shorkie LM using large-scale transcriptomic and epigenomic datasets from S. cerevisiae.

  • Induction Dynamics Gene Expression Atlas (IDEA): RNA-seq induction time-point samples from the Induction Dynamics Gene Expression Atlas (IDEA). New datasets generated by Calico Life Sciences LLC (related to IDEA 1.0; Hackett, S.R. et al., Mol Syst Biol, 2020).

    • [Coverage tracks (BigWig)] gs://shorkie-paper/data/supervised/bigwigs/
    • [Processed TFRecords] gs://shorkie-paper/data/supervised/processed/
  • Yeast strain RNA-seq: RNA-seq datasets across diverse S. cerevisiae strains (Caudal, É. et al., Nat Genet, 2024).

  • ChIP-exo & ChIP-MNase: (Rossi, M.J. et al., Nature, 2021).

  • The training script is at model/shorkie on GitHub.


Benchmark data availability

This section lists external benchmark datasets used to evaluate Shorkie, along with their sources and primary references.

MPRA (Random Promoter DREAM Challenge)

  • Dataset: Random Promoter DREAM Challenge MPRA (held-out set; 71,103 promoter sequences spanning eight categories: native promoters, random 80-bp oligos, high-expression, low-expression, “challenging” sequences, SNV perturbations, motif perturbations, and motif tiling).
  • Primary reference: Rafi, A. M. et al. “A community effort to optimize sequence-based deep learning models of gene regulation.” Nat Biotechnol (2024).
  • Notes: We evaluated Shorkie by replacing MPRA constructs into genomic context upstream of TSSs (details in the paper).

cis-eQTL benchmarks

We evaluate Shorkie and compare to DREAM models on two independent yeast cis-eQTL resources:

  1. Caudal et al. pan-transcriptome

    • Data portal: 1002 Yeast Genomes project
    • Primary reference:
      • Caudal, É. et al. “Pan-transcriptome reveals a large accessory genome contribution to gene expression variation in yeast.” Nat Genet 56, 1278–1287 (2024).
      • Peter, J. et al. “Genome evolution across 1,011 Saccharomyces cerevisiae isolates.” Nature 556, 339–344 (2018).
    • Notes: We benchmarked 1,901 local cis-eQTLs from ~1,000 isolates; negative controls were noncoding SNPs matched by allele, TSS distance, and MAF.
  2. Kita et al. high-resolution eQTLs

    • Supplementary table
    • Primary reference: Kita, R. et al. “High-resolution mapping of cis-regulatory variation in budding yeast.” PNAS 114 (2017).
    • Notes: We benchmarked 683 variants, stratified into Promoter, UTR5, UTR3, and ORF categories.

Minimal Example: Variant Effect Prediction with Shorkie

The minimal_example/ directory contains a self-contained script that demonstrates how to load Shorkie and compute a logSED (log₂ Sequence Effect Difference) score for a single SNP — no fine-tuning required.

Setup

  1. Download model weights (8 folds):

    mkdir -p my_shorkie/train
    for i in 0 1 2 3 4 5 6 7; do
      mkdir -p my_shorkie/train/f${i}c0/train
      wget -O my_shorkie/train/f${i}c0/train/model_best.h5 \
        https://storage.googleapis.com/seqnn-share/shorkie/f${i}/model_best.h5
    done
  2. Provide a yeast genome FASTA + GTF (e.g. S. cerevisiae R64).

Run

python minimal_example/run_shorkie_variant.py \
  --model_dir  my_shorkie \
  --params_file  minimal_example/params.json \
  --targets_file minimal_example/sheet.txt \
  --fasta_file   /path/to/genome.fasta \
  --gtf_file     /path/to/genome.gtf \
  --chrom chrI --pos 124373 --ref T --alt C --gene YAL016C-B

Output

==================================================
  Variant  : chrI:124373 T>C
  Gene     : YAL016C-B
  logSED   : +0.0557
==================================================
  logSED > 0 → alt increases predicted expression
  logSED < 0 → alt decreases predicted expression

See minimal_example/README.md for full documentation.

About

Data processing and analyses related to regulatory sequence prediction in yeast.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors