TCR-Epitope Binding Prediction Pipeline

A reproducible bioinformatics pipeline for predicting T-cell receptor (TCR) specificity to peptide epitopes, using multiple sequence embedding strategies and machine learning models.

Overview

This pipeline addresses a core challenge in computational immunology: predicting which TCRs bind to specific peptide epitopes. It integrates data from three major public TCR databases, applies multiple embedding methods, and benchmarks several ML approaches including a Bayesian Neural Network for uncertainty quantification.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    TCR-Epitope Predict Pipeline                  │
├─────────┬────────────┬──────────┬────────────┬─────────────────┤
│  Data   │  StitchR   │Embedding │ Prediction │   Evaluation    │
│Ingestion│ Assembly   │          │            │                 │
├─────────┼────────────┼──────────┼────────────┼─────────────────┤
│ VDJdb   │ V/J gene → │ BLOSUM62 │ Logistic   │ ROC / PR curves │
│ McPAS   │ Full-length│ ESM-2    │ Regression │ Model heatmap   │
│ IEDB    │ TCR seqs   │ k-mer    │ Random     │ BNN uncertainty │
│(train/  │            │ TCRdist3 │ Forest     │ Cross-DB valid. │
│ valid)  │            │          │ BNN (Pyro) │                 │
│         │            │  + UMAP  │ NetTCR-2.2 │                 │
└─────────┴────────────┴──────────┴────────────┴─────────────────┘

Key Design Decisions

Decision	Choice	Rationale
Workflow engine	Nextflow DSL2	Industry standard in EU biotech/pharma; nf-core compatible
Training data	VDJdb + McPAS	Similar curation, good epitope coverage
Validation data	IEDB (held-out)	Independent curation prevents data leakage
Primary embedding	BLOSUM62	Outperforms PLMs per Briefings in Bioinformatics 2025
Comparison embedding	ESM-2	State-of-the-art PLM for protein sequences
Uncertainty	BNN via Pyro	Calibrated confidence for clinical relevance
CV strategy	Epitope-grouped	Prevents inflation per Nature Methods 2025

Quick Start

Using Docker (recommended)

# Pull and run the full pipeline
docker-compose up pipeline

# With GPU support for ESM-2 embedding
docker-compose up pipeline-gpu

Using Python

# Install
pip install -e .

# Run individual steps
tcr-predict download -o data/raw
tcr-predict clean -i data/raw -o data/processed
tcr-predict stitchr -i data/processed/paired_tcr_epitope.tsv
tcr-predict embed -i data/processed/tcr_full_length.tsv -o data/results/embeddings
tcr-predict predict -i data/results/embeddings -o data/results/predictions
tcr-predict evaluate -i data/results/predictions -o results/evaluation

# Or run everything at once
tcr-predict run-all

Using Nextflow

# With Docker
nextflow run main.nf -profile docker

# With Conda
nextflow run main.nf -profile conda

# Dry run
nextflow run main.nf -profile test --skip_download -dry-run

Project Structure

tcr-epitope-predict/
├── main.nf                      # Nextflow pipeline entry point
├── nextflow.config              # Nextflow configuration
├── Dockerfile                   # Container definition
├── docker-compose.yml           # Multi-service orchestration
├── pyproject.toml               # Python package configuration
├── environment.yml              # Conda environment
│
├── workflows/                   # Nextflow sub-workflows
│   ├── data_acquisition.nf
│   ├── data_integration.nf
│   ├── stitchr_assembly.nf
│   ├── embedding.nf
│   ├── prediction.nf
│   └── evaluation.nf
│
├── src/tcr_epitope/             # Python package
│   ├── cli.py                   # Command-line interface
│   ├── data/
│   │   ├── downloaders.py       # Database downloaders
│   │   ├── cleaners.py          # Format-specific cleaners
│   │   ├── integrator.py        # Multi-DB integration
│   │   └── statistics.py        # Data visualization
│   ├── assembly/
│   │   └── stitchr_runner.py    # Full-length TCR reconstruction
│   ├── embedding/
│   │   ├── blosum.py            # BLOSUM62 encoding
│   │   ├── esm2.py              # ESM-2 protein LM
│   │   ├── kmer.py              # k-mer bag-of-words
│   │   └── tcrdist_features.py  # TCRdist3 features
│   ├── visualization/
│   │   ├── umap_plot.py         # UMAP + Plotly
│   │   └── evaluation_plots.py  # ROC, PR, heatmaps
│   ├── models/
│   │   ├── classical.py         # LR, Random Forest
│   │   ├── bnn.py               # Bayesian NN (Pyro)
│   │   └── nettcr_wrapper.py    # NetTCR-2.2 integration
│   └── evaluation/
│       ├── cross_validation.py  # Epitope-grouped CV
│       └── benchmark.py         # Multi-model comparison
│
├── notebooks/                   # Analysis notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_embedding_analysis.ipynb
│   ├── 03_model_training.ipynb
│   └── 04_results_comparison.ipynb
│
├── configs/default.yaml         # Pipeline configuration
├── tests/                       # Unit tests
└── .github/workflows/ci.yml     # GitHub Actions CI

Embedding Methods

Epitope Embeddings

BLOSUM62: Mean of per-residue BLOSUM62 substitution score rows. 20-dimensional. The 2025 benchmarking study in Briefings in Bioinformatics found this significantly outperforms transformer-based embeddings.
ESM-2: Facebook/Meta's protein language model (6-layer, 8M params). 320-dimensional mean-pooled hidden states. Included for comparison with handcrafted features.
k-mer BoW: TF-IDF weighted 3-mer bag-of-words. Lightweight computational baseline.

TCR Embeddings

BLOSUM62: Applied to CDR3 alpha and beta chains, concatenated.
ESM-2: Applied to full-length TCR sequences (after Stitchr reconstruction).
TCRdist3: Prototype-based features using the TCR-specific distance metric.

ML Models

Logistic Regression: Interpretable baseline with L2 regularization.
Random Forest: Ensemble method robust to feature noise.
Bayesian Neural Network: PyTorch + Pyro implementation with uncertainty quantification via posterior sampling.
NetTCR-2.2: External published model for comparison (Montemurro et al., 2021).

References

Nature Methods (2025). "Assessment of computational methods in predicting TCR-epitope binding recognition." Benchmarked 50 models across 762 epitopes.
Briefings in Bioinformatics (2025). "A comprehensive benchmarking for evaluating TCR embeddings." BLOSUM62 outperforms transformer approaches.
Heather JM et al. (2022). "Stitchr: stitching coding TCR nucleotide sequences." NAR.
Montemurro A et al. (2021). "NetTCR-2.0: paired TCRalpha and TCRbeta sequence data." Communications Biology.
Mayer-Blackwell K et al. (2021). "TCR meta-clonotypes with tcrdist3." eLife.
Empowering bioinformatics with Nextflow and nf-core. Genome Biology (2025).

License

MIT License. See LICENSE.

Author

Kuiting Tao (kuiting.tao@gmail.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCR-Epitope Binding Prediction Pipeline

Overview

Pipeline Architecture

Key Design Decisions

Quick Start

Using Docker (recommended)

Using Python

Using Nextflow

Project Structure

Embedding Methods

Epitope Embeddings

TCR Embeddings

ML Models

References

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
configs		configs
notebooks		notebooks
scripts		scripts
src/tcr_epitope		src/tcr_epitope
tests		tests
workflows		workflows
.gitignore		.gitignore
Dockerfile		Dockerfile
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

TCR-Epitope Binding Prediction Pipeline

Overview

Pipeline Architecture

Key Design Decisions

Quick Start

Using Docker (recommended)

Using Python

Using Nextflow

Project Structure

Embedding Methods

Epitope Embeddings

TCR Embeddings

ML Models

References

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages