Skip to content

feat: clinical layer on GeoData + load NHANES/FHS through the library#203

Open
marcbal77 wants to merge 1 commit into
bio-learn:masterfrom
marcbal77:feature/clinical-infrastructure
Open

feat: clinical layer on GeoData + load NHANES/FHS through the library#203
marcbal77 wants to merge 1 commit into
bio-learn:masterfrom
marcbal77:feature/clinical-infrastructure

Conversation

@marcbal77
Copy link
Copy Markdown
Member

@marcbal77 marcbal77 commented Mar 29, 2026

Summary

Simple foundation for clinical data in biolearn:

  • GeoData.clinical layer (samples-as-rows, biomarkers-as-columns — same orientation as metadata, matches industry convention for tabular clinical data)
  • load_nhanes_as_geodata(year) and load_fhs_as_geodata() so both data sources flow through the library and return a GeoData
  • A biomarker registry with unit conversions, validated end-to-end against real FHS Period 1 data (glucose mg/dL to mmol/L)
  • An example (examples/01_composite_biomarkers/plot_load_nhanes_through_library.py) showing the through-the-library pattern

What changed since the first round

Based on your review:

  • Orientation: Clinical data is now samples-as-rows (industry standard). Each row is an entity (patient), each column is a thing about that entity (biomarker). Matches metadata.
  • required_features(): Dropped entirely from this PR. The schema-check / missing-data consumer path is a separate PR.
  • UK Biobank preset: Removed. We have not validated the library against real UK Biobank data, so claiming support would be misleading. Swapped in an fhs preset that is validated end-to-end against the real Framingham Heart Study Period 1 data we already load.
  • Canonical units: Fixed albumin (g/L) and creatinine (umol/L) to match what load_nhanes actually returns from the SI columns.

Test plan

  • make test: 198 passed, 5 skipped
  • make format: clean
  • FHS source preset validated end-to-end against the real frmgham2.csv (test_load_fhs_as_geodata_applies_fhs_unit_conversion)
  • load_fhs and load_fhs_as_geodata produce matching glucose values
  • Save/load roundtrip with the clinical layer
  • All 69 existing clocks still pass via the unchanged test_model.py

Comment thread biolearn/load.py Outdated
return df


def load_nhanes_as_geodata(year):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a data library entry for this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. NHANES and FHS are now registered as DataLibrary entries, loaded via the same pattern as the GEO sources:

from biolearn.data_library import DataLibrary
data = DataLibrary().get("NHANES_2010").load()

Added NhanesParser and FhsParser classes in data_library.py, three YAML entries (NHANES_2010, NHANES_2012, FHS), and removed the top-level load_nhanes_as_geodata and load_fhs_as_geodata convenience functions so there's a single path. The example uses DataLibrary too.

@marcbal77 marcbal77 force-pushed the feature/clinical-infrastructure branch from a8247fc to e5b1674 Compare April 23, 2026 05:58
@marcbal77 marcbal77 force-pushed the feature/clinical-infrastructure branch from e5b1674 to 2338b69 Compare May 26, 2026 07:56
@marcbal77 marcbal77 changed the title feat: clinical infrastructure for blood biomarker clocks feat: clinical layer on GeoData + load NHANES/FHS through the library May 26, 2026
…ibrary

- Add a clinical layer on GeoData (samples-as-rows, biomarkers-as-columns,
  same orientation as metadata)
- Add load_nhanes_as_geodata and load_fhs_as_geodata so both data sources
  flow through the library and return a GeoData
- Add a biomarker registry with unit conversions, validated end-to-end
  against real FHS Period 1 data (glucose mg/dL -> mmol/L)
- Drop the unverified UK Biobank source preset; we have not validated it
  against real UK Biobank data
- Add an example showing the NHANES through-the-library pattern

The required_features() interface and the consumer-facing missing-data
error path will be a separate PR.

Addresses bio-learn#194
@marcbal77 marcbal77 force-pushed the feature/clinical-infrastructure branch from 2338b69 to 24f4644 Compare May 26, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants