A content-based image retrieval (CBIR) connectome that visualizes tumor-level similarity between 289 lung cancer patients from the National Lung Screening Trial (NLST). For each patient, the tumor region is defined by bounding boxes from the NLST-Sybil analysis results collection, and feature embeddings are extracted from these regions using 9 foundation models. The connectome shows which tumors look most (and least) alike according to each model.
This interactive map shows how similar lung tumors look to AI. Each dot is a real lung cancer patient from a large U.S. clinical study. Researchers took CT scans of 289 patients, identified the tumor in each scan, and fed those tumor regions into 9 different AI models. Each AI model converts the tumor image into a mathematical fingerprint (called an embedding) — a list of hundreds of numbers that capture the visual features of the tumor as the AI "sees" it. We then compared all fingerprints to find which tumors look most — and least — similar to each other.
For each patient, the AI compares their tumor fingerprint against all other 288 patients to find the 5 closest (most similar) and 5 farthest (least similar) matches:
- Most Similar: The 5 patients whose tumors the AI considers to look most alike are connected by lines. A higher similarity percentage on a line means the two tumors look nearly identical to the AI.
- Least Similar: The reverse — the 5 patients whose tumors the AI sees as looking the most different, highlighting outlier relationships and unusual tumors.
Similarity is measured using cosine distance, which compares the "direction" of two fingerprint vectors. Think of each fingerprint as an arrow pointing in some direction in a high-dimensional space. If two arrows point nearly the same way, the angle between them is small and the tumors are very similar (cosine distance close to 0). If they point in very different directions, the tumors are very different (cosine distance close to 1). The similarity % shown on edges is (1 − cosine distance) × 100 — so 95% means nearly identical fingerprints, while 50% means quite different. Cosine distance measures the pattern of features rather than their overall strength, so two tumors can be "similar" even if one has a stronger signal, as long as the relative feature patterns match.
- Dot size: Patients who appear as a top match for many others get bigger dots — they represent "typical" tumors that many other tumors resemble.
- Clustering: The graph uses a physics simulation that pulls connected patients closer together. Visible clusters mean the AI found groups of similar-looking tumors.
- Color: Dots can be colored by clinical features (cancer type, sex, age, etc.) to see whether AI-perceived similarity correlates with clinical characteristics.
- Different models, different views: Each of the 9 AI models was trained differently and focuses on different features. Switching models may reveal different groupings — some may cluster by tumor size, others by texture or shape.
The NLST-Sybil collection contains tumor annotations for a larger set of patients, but some patients have multiple annotated lesions. To simplify the analysis to one embedding per patient per model, only patients with exactly one annotated lesion are included, resulting in 289 patients. This avoids the ambiguity of how to compare patients with multiple tumors (e.g., closest lesion, average embedding, all pairwise combinations).
All imaging data, lesion annotations and clinical data are sourced from NCI Imaging Data Commons (IDC).
This project originated at NA-MIC Project Week 44 (Gran Canaria, 2026).
The visualization shows all 289 patients simultaneously as a force-directed graph. Each node is a patient; edges connect each patient to its 5 most similar (or most dissimilar) matches based on cosine distance between tumor region embeddings. Nodes cluster spatially by embedding similarity — patients whose tumors have similar feature representations appear closer together.
- Full Network Visualization: All patients rendered simultaneously in a force-directed layout — nodes sized by connectivity, edges weighted by cosine distance.
- Multi-Model Support: Compare results across 9 foundation models (CTClipVit, CTFM, FMCIB, Merlin, ModelsGen, PASTA, SUPREME, VISTA3D, Voco).
- Most Similar / Least Similar Toggle: Switch between viewing the 5 closest and 5 farthest matches per patient to explore both similarity clusters and outlier relationships.
- Clinical Coloring: Color nodes by 6 clinical facets — Sex, Age, Race, Smoking Status, Cancer Type, and Stage.
- IDC Viewer Integration: Click any node to open the CT scan and NLST-Sybil tumor annotation SR in the IDC OHIF Viewer.
- Self-Contained: All data embedded directly in the HTML (~700 KB) — no server or runtime data fetching needed.
- Interactive Detail Panel: Click any node to see clinical data, IDC viewer links, and outgoing/incoming match lists.
- Imaging Data: Low-dose CT scans from the NLST collection on IDC. All imaging data accessed via IDC is subject to the IDC Data Use Agreement.
- Tumor Annotations: Bounding boxes of suspicious lesions produced by Sybil (Mikhael et al., JCO 2023), converted to DICOM Structured Reports and hosted on IDC as the NLST-Sybil analysis results collection (Krishnaswamy, Clunie & Fedorov, 2025). These bounding boxes define the tumor regions from which foundation model embeddings are extracted.
- Foundation Model Embeddings: Each of the 9 foundation models (see below) is applied to the tumor region defined by the NLST-Sybil bounding box to produce a feature vector per patient. Pairwise cosine distances between all 289 patients are computed via BigQuery.
- Clinical Metadata: Age, sex, race, and smoking status queried from the NLST clinical tables via idc-index.
- Cancer Annotations: Cancer type (ICD-O-3 morphology) and AJCC stage from the NLST clinical data.
| Model | Reference |
|---|---|
| CT-CLIP (CTClipVit) | Hamamci et al., Nature Biomedical Engineering (2025). doi:10.1038/s41551-025-01599-y |
| CT-FM | Pai et al., "Vision Foundation Models for Computed Tomography" (2025). arXiv:2501.09001 |
| FMCIB | Pai et al., Nature Machine Intelligence 6, 354–367 (2024). doi:10.1038/s42256-024-00807-9 |
| Merlin | Blankemeier et al., "Merlin: A Vision Language Foundation Model for 3D CT" (2024). arXiv:2406.06512 |
| ModelsGenesis | Zhou et al., MICCAI 11767, 384–393 (2019). doi:10.1007/978-3-030-32251-9_42 |
| PASTA | Lei et al., "A Data-Efficient Pan-Tumor Foundation Model for Oncology CT" (2025). arXiv:2501.10785 |
| SuPreM | Li, Yuille & Zhou, "How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?" (2025). arXiv:2501.11253 |
| VISTA3D | He et al., "VISTA3D: Versatile Imaging SegmenTation and Annotation Model for 3D CT" (2024). arXiv:2406.05285 |
| VoCo | Wu, Zhuang & Chen, IEEE TPAMI (2025). doi:10.1109/TPAMI.2025.3639593 |
The data pipeline runs in Google Colab using BigQuery:
- Tumor Region Definition: NLST-Sybil bounding boxes define the tumor region of interest for each patient.
- Embedding Extraction: Each of the 9 foundation models extracts a feature vector from the tumor region.
- Distance Calculation: Pairwise cosine distances computed between all 289 patients for each model.
- SR Mapping: BigQuery queries link patients to their NLST-Sybil tumor annotation Structured Reports by navigating the DICOM
ReferencedSeriesSequencehierarchy. - URL Generation: Constructs IDC OHIF Viewer URLs that open both the CT series and the NLST-Sybil SR.
├── LICENSE Apache 2.0
├── README.md
├── docs/
│ └── index.html Dashboard (self-contained, ~790 KB)
├── embeddings/
│ └── *_features.pkl Embedding pickle files (9 foundation models)
└── notebooks/
├── create_connectome_data_table.ipynb BigQuery pipeline
└── create_cbir_demo.ipynb GCS deployment
The dashboard is a fully self-contained single-file web application (docs/index.html) with all data embedded as JavaScript constants. No server, no CSV fetching, no CORS configuration needed.
Simply open docs/index.html in a modern web browser. The only external dependency is Apache ECharts (loaded from CDN). CT slice thumbnails are loaded on demand from the IDC DICOMweb proxy.
- Live Demo: NLST Sybil Tumor Connectome
- IDC Portal: Imaging Data Commons
- NLST on IDC: NLST Collection
See DEVELOPER.md for prerequisites, data pipeline details, and instructions for rebuilding the dashboard.
This project uses data from the NCI Imaging Data Commons (IDC), which is a cloud-based environment containing publicly available cancer imaging data co-located with analysis and exploration tools and resources. IDC is a node within the Cancer Research Data Commons (CRDC) infrastructure of the U.S. National Cancer Institute.
If you use this work, please cite:
Fedorov, A., Longabaugh, W.J.R., Pot, D. et al. NCI Imaging Data Commons. Cancer Res 81, 4188–4193 (2021). https://doi.org/10.1158/0008-5472.CAN-21-0950
Mikhael, P.G., Wohlwend, J., Golia Pernicka, J.S. et al. Sybil: A Validated Deep Learning Model to Predict Future Lung Cancer Risk From a Single Low-Dose Chest CT. J Clin Oncol 41, 2191–2200 (2023). https://doi.org/10.1200/JCO.22.01345
Krishnaswamy, D., Clunie, D. A., & Fedorov, A. (2025). NLST-Sybil: Expert annotations of tumor regions in the NLST CT images [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15643334
Deepa Krishnaswamy and Andrey Fedorov
Brigham and Women's Hospital
March 2026
