Skip to content

Kweiting-Tao/tcr-epitope-predict

Repository files navigation

TCR-Epitope Binding Prediction Pipeline

A reproducible bioinformatics pipeline for predicting T-cell receptor (TCR) specificity to peptide epitopes, using multiple sequence embedding strategies and machine learning models.

Overview

This pipeline addresses a core challenge in computational immunology: predicting which TCRs bind to specific peptide epitopes. It integrates data from three major public TCR databases, applies multiple embedding methods, and benchmarks several ML approaches including a Bayesian Neural Network for uncertainty quantification.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    TCR-Epitope Predict Pipeline                  │
├─────────┬────────────┬──────────┬────────────┬─────────────────┤
│  Data   │  StitchR   │Embedding │ Prediction │   Evaluation    │
│Ingestion│ Assembly   │          │            │                 │
├─────────┼────────────┼──────────┼────────────┼─────────────────┤
│ VDJdb   │ V/J gene → │ BLOSUM62 │ Logistic   │ ROC / PR curves │
│ McPAS   │ Full-length│ ESM-2    │ Regression │ Model heatmap   │
│ IEDB    │ TCR seqs   │ k-mer    │ Random     │ BNN uncertainty │
│(train/  │            │ TCRdist3 │ Forest     │ Cross-DB valid. │
│ valid)  │            │          │ BNN (Pyro) │                 │
│         │            │  + UMAP  │ NetTCR-2.2 │                 │
└─────────┴────────────┴──────────┴────────────┴─────────────────┘

Key Design Decisions

Decision Choice Rationale
Workflow engine Nextflow DSL2 Industry standard in EU biotech/pharma; nf-core compatible
Training data VDJdb + McPAS Similar curation, good epitope coverage
Validation data IEDB (held-out) Independent curation prevents data leakage
Primary embedding BLOSUM62 Outperforms PLMs per Briefings in Bioinformatics 2025
Comparison embedding ESM-2 State-of-the-art PLM for protein sequences
Uncertainty BNN via Pyro Calibrated confidence for clinical relevance
CV strategy Epitope-grouped Prevents inflation per Nature Methods 2025

Quick Start

Using Docker (recommended)

# Pull and run the full pipeline
docker-compose up pipeline

# With GPU support for ESM-2 embedding
docker-compose up pipeline-gpu

Using Python

# Install
pip install -e .

# Run individual steps
tcr-predict download -o data/raw
tcr-predict clean -i data/raw -o data/processed
tcr-predict stitchr -i data/processed/paired_tcr_epitope.tsv
tcr-predict embed -i data/processed/tcr_full_length.tsv -o data/results/embeddings
tcr-predict predict -i data/results/embeddings -o data/results/predictions
tcr-predict evaluate -i data/results/predictions -o results/evaluation

# Or run everything at once
tcr-predict run-all

Using Nextflow

# With Docker
nextflow run main.nf -profile docker

# With Conda
nextflow run main.nf -profile conda

# Dry run
nextflow run main.nf -profile test --skip_download -dry-run

Project Structure

tcr-epitope-predict/
├── main.nf                      # Nextflow pipeline entry point
├── nextflow.config              # Nextflow configuration
├── Dockerfile                   # Container definition
├── docker-compose.yml           # Multi-service orchestration
├── pyproject.toml               # Python package configuration
├── environment.yml              # Conda environment
│
├── workflows/                   # Nextflow sub-workflows
│   ├── data_acquisition.nf
│   ├── data_integration.nf
│   ├── stitchr_assembly.nf
│   ├── embedding.nf
│   ├── prediction.nf
│   └── evaluation.nf
│
├── src/tcr_epitope/             # Python package
│   ├── cli.py                   # Command-line interface
│   ├── data/
│   │   ├── downloaders.py       # Database downloaders
│   │   ├── cleaners.py          # Format-specific cleaners
│   │   ├── integrator.py        # Multi-DB integration
│   │   └── statistics.py        # Data visualization
│   ├── assembly/
│   │   └── stitchr_runner.py    # Full-length TCR reconstruction
│   ├── embedding/
│   │   ├── blosum.py            # BLOSUM62 encoding
│   │   ├── esm2.py              # ESM-2 protein LM
│   │   ├── kmer.py              # k-mer bag-of-words
│   │   └── tcrdist_features.py  # TCRdist3 features
│   ├── visualization/
│   │   ├── umap_plot.py         # UMAP + Plotly
│   │   └── evaluation_plots.py  # ROC, PR, heatmaps
│   ├── models/
│   │   ├── classical.py         # LR, Random Forest
│   │   ├── bnn.py               # Bayesian NN (Pyro)
│   │   └── nettcr_wrapper.py    # NetTCR-2.2 integration
│   └── evaluation/
│       ├── cross_validation.py  # Epitope-grouped CV
│       └── benchmark.py         # Multi-model comparison
│
├── notebooks/                   # Analysis notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_embedding_analysis.ipynb
│   ├── 03_model_training.ipynb
│   └── 04_results_comparison.ipynb
│
├── configs/default.yaml         # Pipeline configuration
├── tests/                       # Unit tests
└── .github/workflows/ci.yml     # GitHub Actions CI

Embedding Methods

Epitope Embeddings

  • BLOSUM62: Mean of per-residue BLOSUM62 substitution score rows. 20-dimensional. The 2025 benchmarking study in Briefings in Bioinformatics found this significantly outperforms transformer-based embeddings.
  • ESM-2: Facebook/Meta's protein language model (6-layer, 8M params). 320-dimensional mean-pooled hidden states. Included for comparison with handcrafted features.
  • k-mer BoW: TF-IDF weighted 3-mer bag-of-words. Lightweight computational baseline.

TCR Embeddings

  • BLOSUM62: Applied to CDR3 alpha and beta chains, concatenated.
  • ESM-2: Applied to full-length TCR sequences (after Stitchr reconstruction).
  • TCRdist3: Prototype-based features using the TCR-specific distance metric.

ML Models

  • Logistic Regression: Interpretable baseline with L2 regularization.
  • Random Forest: Ensemble method robust to feature noise.
  • Bayesian Neural Network: PyTorch + Pyro implementation with uncertainty quantification via posterior sampling.
  • NetTCR-2.2: External published model for comparison (Montemurro et al., 2021).

References

  1. Nature Methods (2025). "Assessment of computational methods in predicting TCR-epitope binding recognition." Benchmarked 50 models across 762 epitopes.
  2. Briefings in Bioinformatics (2025). "A comprehensive benchmarking for evaluating TCR embeddings." BLOSUM62 outperforms transformer approaches.
  3. Heather JM et al. (2022). "Stitchr: stitching coding TCR nucleotide sequences." NAR.
  4. Montemurro A et al. (2021). "NetTCR-2.0: paired TCRalpha and TCRbeta sequence data." Communications Biology.
  5. Mayer-Blackwell K et al. (2021). "TCR meta-clonotypes with tcrdist3." eLife.
  6. Empowering bioinformatics with Nextflow and nf-core. Genome Biology (2025).

License

MIT License. See LICENSE.

Author

Kuiting Tao (kuiting.tao@gmail.com)

Releases

No releases published

Packages

 
 
 

Contributors