A reproducible bioinformatics pipeline for predicting T-cell receptor (TCR) specificity to peptide epitopes, using multiple sequence embedding strategies and machine learning models.
This pipeline addresses a core challenge in computational immunology: predicting which TCRs bind to specific peptide epitopes. It integrates data from three major public TCR databases, applies multiple embedding methods, and benchmarks several ML approaches including a Bayesian Neural Network for uncertainty quantification.
┌─────────────────────────────────────────────────────────────────┐
│ TCR-Epitope Predict Pipeline │
├─────────┬────────────┬──────────┬────────────┬─────────────────┤
│ Data │ StitchR │Embedding │ Prediction │ Evaluation │
│Ingestion│ Assembly │ │ │ │
├─────────┼────────────┼──────────┼────────────┼─────────────────┤
│ VDJdb │ V/J gene → │ BLOSUM62 │ Logistic │ ROC / PR curves │
│ McPAS │ Full-length│ ESM-2 │ Regression │ Model heatmap │
│ IEDB │ TCR seqs │ k-mer │ Random │ BNN uncertainty │
│(train/ │ │ TCRdist3 │ Forest │ Cross-DB valid. │
│ valid) │ │ │ BNN (Pyro) │ │
│ │ │ + UMAP │ NetTCR-2.2 │ │
└─────────┴────────────┴──────────┴────────────┴─────────────────┘
| Decision | Choice | Rationale |
|---|---|---|
| Workflow engine | Nextflow DSL2 | Industry standard in EU biotech/pharma; nf-core compatible |
| Training data | VDJdb + McPAS | Similar curation, good epitope coverage |
| Validation data | IEDB (held-out) | Independent curation prevents data leakage |
| Primary embedding | BLOSUM62 | Outperforms PLMs per Briefings in Bioinformatics 2025 |
| Comparison embedding | ESM-2 | State-of-the-art PLM for protein sequences |
| Uncertainty | BNN via Pyro | Calibrated confidence for clinical relevance |
| CV strategy | Epitope-grouped | Prevents inflation per Nature Methods 2025 |
# Pull and run the full pipeline
docker-compose up pipeline
# With GPU support for ESM-2 embedding
docker-compose up pipeline-gpu# Install
pip install -e .
# Run individual steps
tcr-predict download -o data/raw
tcr-predict clean -i data/raw -o data/processed
tcr-predict stitchr -i data/processed/paired_tcr_epitope.tsv
tcr-predict embed -i data/processed/tcr_full_length.tsv -o data/results/embeddings
tcr-predict predict -i data/results/embeddings -o data/results/predictions
tcr-predict evaluate -i data/results/predictions -o results/evaluation
# Or run everything at once
tcr-predict run-all# With Docker
nextflow run main.nf -profile docker
# With Conda
nextflow run main.nf -profile conda
# Dry run
nextflow run main.nf -profile test --skip_download -dry-runtcr-epitope-predict/
├── main.nf # Nextflow pipeline entry point
├── nextflow.config # Nextflow configuration
├── Dockerfile # Container definition
├── docker-compose.yml # Multi-service orchestration
├── pyproject.toml # Python package configuration
├── environment.yml # Conda environment
│
├── workflows/ # Nextflow sub-workflows
│ ├── data_acquisition.nf
│ ├── data_integration.nf
│ ├── stitchr_assembly.nf
│ ├── embedding.nf
│ ├── prediction.nf
│ └── evaluation.nf
│
├── src/tcr_epitope/ # Python package
│ ├── cli.py # Command-line interface
│ ├── data/
│ │ ├── downloaders.py # Database downloaders
│ │ ├── cleaners.py # Format-specific cleaners
│ │ ├── integrator.py # Multi-DB integration
│ │ └── statistics.py # Data visualization
│ ├── assembly/
│ │ └── stitchr_runner.py # Full-length TCR reconstruction
│ ├── embedding/
│ │ ├── blosum.py # BLOSUM62 encoding
│ │ ├── esm2.py # ESM-2 protein LM
│ │ ├── kmer.py # k-mer bag-of-words
│ │ └── tcrdist_features.py # TCRdist3 features
│ ├── visualization/
│ │ ├── umap_plot.py # UMAP + Plotly
│ │ └── evaluation_plots.py # ROC, PR, heatmaps
│ ├── models/
│ │ ├── classical.py # LR, Random Forest
│ │ ├── bnn.py # Bayesian NN (Pyro)
│ │ └── nettcr_wrapper.py # NetTCR-2.2 integration
│ └── evaluation/
│ ├── cross_validation.py # Epitope-grouped CV
│ └── benchmark.py # Multi-model comparison
│
├── notebooks/ # Analysis notebooks
│ ├── 01_data_exploration.ipynb
│ ├── 02_embedding_analysis.ipynb
│ ├── 03_model_training.ipynb
│ └── 04_results_comparison.ipynb
│
├── configs/default.yaml # Pipeline configuration
├── tests/ # Unit tests
└── .github/workflows/ci.yml # GitHub Actions CI
- BLOSUM62: Mean of per-residue BLOSUM62 substitution score rows. 20-dimensional. The 2025 benchmarking study in Briefings in Bioinformatics found this significantly outperforms transformer-based embeddings.
- ESM-2: Facebook/Meta's protein language model (6-layer, 8M params). 320-dimensional mean-pooled hidden states. Included for comparison with handcrafted features.
- k-mer BoW: TF-IDF weighted 3-mer bag-of-words. Lightweight computational baseline.
- BLOSUM62: Applied to CDR3 alpha and beta chains, concatenated.
- ESM-2: Applied to full-length TCR sequences (after Stitchr reconstruction).
- TCRdist3: Prototype-based features using the TCR-specific distance metric.
- Logistic Regression: Interpretable baseline with L2 regularization.
- Random Forest: Ensemble method robust to feature noise.
- Bayesian Neural Network: PyTorch + Pyro implementation with uncertainty quantification via posterior sampling.
- NetTCR-2.2: External published model for comparison (Montemurro et al., 2021).
- Nature Methods (2025). "Assessment of computational methods in predicting TCR-epitope binding recognition." Benchmarked 50 models across 762 epitopes.
- Briefings in Bioinformatics (2025). "A comprehensive benchmarking for evaluating TCR embeddings." BLOSUM62 outperforms transformer approaches.
- Heather JM et al. (2022). "Stitchr: stitching coding TCR nucleotide sequences." NAR.
- Montemurro A et al. (2021). "NetTCR-2.0: paired TCRalpha and TCRbeta sequence data." Communications Biology.
- Mayer-Blackwell K et al. (2021). "TCR meta-clonotypes with tcrdist3." eLife.
- Empowering bioinformatics with Nextflow and nf-core. Genome Biology (2025).
MIT License. See LICENSE.
Kuiting Tao (kuiting.tao@gmail.com)