🔥🔥🔥 A survey on data-centric foundation models in computational healthcare
Last updated: 2024/10/08
📝 If you find this repo helps, please kindly cite our survey, thanks!
@article{zhang2024data,
title={Data-Centric Foundation Models in Computational Healthcare: A Survey},
author={Zhang, Yunkun and Gao, Jin and Tan, Zheling and Zhou, Lingfeng and Ding, Kexin and Zhou, Mu and Zhang, Shaoting and Wang, Dequan},
journal={arXiv preprint arXiv:2401.02458},
year={2024}
}
In this repository, we provide an up-to-date list of healthcare-related foundation models and datasets, which are also mentioned in our survey paper.
📖 Contents
A star (*) after the pre-training data shows that the authors constructed the data with more than three sources.
| Model | Subfield | Paper | Code | Base | Pre-Training Data |
|---|---|---|---|---|---|
| nach0 | Molecules | nach0: Multimodal Natural and Chemical Languages Foundation Model | Github | T5 | * |
| MoleculeSTM | Drug | Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing | Github | CLIP | PubChem |
| AlphaMissense | Proteomics | Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense | Github | AlphaFold | PDB + UniRef |
| GET | Genomics | GET: A Foundation Model of Transcription across Human Cell Types | Huggingface | Transformer | * |
| GIT-Mol | Molecules | GIT-Mol: A Multi-Modal Large Language Model for Molecular Science with Graph, Image, and Text | Github | T5 + BLIP-2 | PubChem |
| ESM-2 | Proteomics | Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model | Github | Transformer | UniRef |
| AlphaFold 2 | Proteomics | Highly Accurate Protein Structure Prediction with AlphaFold | Github | - | PDB + Uniclust30 |
| Model | Subfield | Paper | Code | Base | Pre-Training Data |
|---|---|---|---|---|---|
| OmniNA | Nucleotide sequence | OmniNA: A Foundation Model for Nucleotide Sequences | - | LLaMA | NCBI |
| LaBraM | EEG | Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI | - | Transformer | * |
| Neuro-GPT | EEG | Neuro-GPT: Developing A Foundation Model for EEG | - | - | TUH EEG |
| Dataset (Paper) | Description | Link |
|---|---|---|
| MedBench (arXiv) | A Chinese medical LLM benchmark with 300,901 Chinese questions covering 43 clinical specialties, combined with an automatic evaluation system | Official site |
| MMedBench (arXiv) | A multilingual medical QA benchmark, where questions are categorized into 21 topics | Github |
| MMedC (arXiv) | A multilingual medical corpus containing over 25.5B tokens | Github |
| BiMed1.3M (arXiv) | An English and Arabic bilingual dataset of 1.3M samples of medical QA and chat | Github |
| GAP-Replay (arXiv) | 48.1B tokens from 4 medical corpora including guidelines, abstracts, papers, and replay | Github |
| Huatuo-26M (arXiv) | 26M Chinese medical QA pairs | Github |
| Medical Meadow (arXiv) | 16M medical QA pairs collected from 9 sources | Github |
| MultiMedQA (Nature) | 6 existing and 1 online-collected medical QA dataset | Nature |
| BigBio (Nature) | 126+ biomedical NLP datasets covering 13 task categories and 10+ languages | Github |
| MedMCQA (MLR) | 194K multiple-choice questions covering 2.4K healthcare topics | Official site |
| MedQA-USMLE (MDPI) | 61,097 multiple choice questions based on USMLE in three languages | Github |
| CBLUE (arXiv) | A Chinese biomedical language understanding evaluation benchmark with 18 datasets | Official site |
| BLURB (arXiv) | 13 biomedical NLP datasets in 6 tasks | Official site |
| PubMedQA (arXiv) | 1K expert-annotated, 61.2K unlabeled, and 211.3K artificially generated biomedical QA instances | Official site |
| BLUE (arXiv) | 5 language tasks with 10 biomedical and clinical text datasets | Github |
| webMedQA (BMC) | 63,284 real-world Chinese medical questions with over 300K answers | Github |
| MedMentions (arXiv) | 4,392 papers annotated by experts with mentions of UMLS entities | Github |
| MIMIC-III (Nature) | Critical care data for over 40,000 patients | Official site |
| ClinicalTrials.gov | An online database of clinical research studies, including clinical trials and observational studies | Official site |
| Dataset (Paper) | Description | Link |
|---|---|---|
| Mass-100K (arXiv) | 100M tissue patches from 100,426 diagnostic H&E WSIs accross 20 major tissue types | - |
| RETFound (Nature) | Unannotated retinal images, containing 904,170 CFPs and 736,442 OCT scans | Nature |
| AbdomenAtlas-8K (arXiv) | 8,448 CT volumes with per-voxel annotated eight abdominal organs | Github |
| Med-MNIST v2 (Nature) | 12 2D and 6 3D datasets for biomedical image classification | Official site |
| EchoNet-Dynamic (Nature) | 10,030 expert-annotated echocardiogram videos | Official site |
| CheXpert (arXiv) | 224,316 chest radiographs of 65,240 patients | Official site |
| Kather Colon Dataset (PMC) | 100K histological images of human colorectal cancer and healthy tissue | Zenodo |
| DeepLesion (PMC) | 32K CT scans with annotations and semantic labels from radiological reports | NIH |
| ChestXray-NIHCC (arXiv) | 100K radiographs with labels from more than 30,000 patients | NIH |
| ISIC | An archive containing 23K skin lesion images with labels & Imaging | Official site |
| Dataset (Paper) | Description | Link |
|---|---|---|
| 1000 Genomes Project (Nature) | A comprehensive catalog of human genetic variations | Official site |
| ENCODE (Nature) | A platform of genomics data and encyclopedia with integrative-level and ground-level annotations | NIH |
| dbSNP (NIH) | A collection of human single nucleotide variations, microsatellites, and small-scale insertions and deletions | NIH |
| Dataset (Paper) | Description | Link |
|---|---|---|
| DrugChat (arXiv) | 143,517 question-answer pairs covering 10,834 drug compounds, collected from PubChem and ChEMBL | Github |
| PubChem (NIH) | A collection of 900+ sources of chemical information data | NIH |
| DrugBank (NIH) | A web-enabled structured database of molecular information about drugs | Official site |
| ChEMBL (NIH) | 20M bioactivity measurements for 2.4M distinct compounds and 15K protein targets | Official site |
| Dataset (Paper) | Description | Link |
|---|---|---|
| RadGenome-Chest CT (arXiv) | A dataset of 3D chest CT, including 197 organ-level segmentation masks, 665K multi-granularity grounded reports, and 1.3M grounded VQA pairs | - |
| OmniMedVQA (arXiv) | 131,813 question-answering items with 120,530 images from 12 modalities and 26 human anatomical regions, collected from 75 medical datasets | - |
| SAT-DS (arXiv) | 11,462 scans with 142,254 segmentation annotations spanning 8 human body regions from 31 medical image segmentation datasets, together with domain knowledge from e-Anatomy and UMLS | Github |
| PathChatInstruct (arXiv) | 257,004 instructions of pathology-specific queries with image and text | - |
| Chi-Med-VL (arXiv) | 580,014 image-text pairs and 469,441 question-answer pairs for general healthcare in Chinese | Github |
| MedMD (arXiv) | 15.5M 2D scans and 180k 3D radiology scans with textual descriptions | Github |
| OpenPath (Nature) | 208,414 pathology images paired with natural language descriptions | Huggingface |
| Quilt-1M (arXiv) | 1M image-text pairs for histopathology | Github |
| Med-MMHL (arXiv) | Human- and LLM-generated misinformation detection dataset | Github |
| Mol-Instructions (arXiv) | 148K molecule-oriented, 505K protein-oriented, and biomolecular text instructions | Huggingface |
| PathInstruct (arXiv) | 180K samples of LLM-generated instruction-following data | Github |
| PMC-VQA (arXiv) | 227K VQA pairs of 149K images of various modalities or diseases | Github |
| PMC-OA (arXiv) | 1.6M fine-grained biomedical image-text pairs | Github |
| PathCap (arXiv) | 142K pathology image-caption pairs from various sources | Github |
| SwissProtCLAP (arXiv) | 441K text-protein sequence pairs | Github |
| MIMIC-IV (Nature) | Clinical information for hospital stays of over 60,000 patients | Official site |
| MIMIC-CXR (Nature) | 227,835 chest imaging studies with free-text reports for 65,379 patients | PhysioNet |
| TCGA | A landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types | Official site |