Skip to content

AI-sandbox/hprc-pclai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PCLAI Banner

Point Cloud Local Ancestry Inference (PCLAI) — HPRC Release 2

Quick links: Read the Manual: PCLAI Manual v.0.1 | Reference PCA space (PC1–PC2 + metadata): Reference PCA Metadata | Index files: GRCh38 - CHM13 - Assembly | Official PCLAI repo: PCLAI Code | Read the Preprint: biorxiv

PCLAI is a deep learning-based approach for inferring continuous population genetic structure along the genome. Instead of assigning each genomic window to a discrete ancestry label, PCLAI predicts a continuous coordinate (e.g., a point in PC1–PC2 space) for every window, together with a per-window confidence score.


What PCLAI provides

For each genomic window (1000 SNPs), PCLAI outputs:

  • Continuous coordinates per window
    A low-dimensional coordinate (e.g., (PC1, PC2)) representing where that window lies in a reference genetic space.

  • Confidence score per window
    A value in [0, 1000] where higher = more confident. We filter out very low-confidence predictions in the distributed BED files.

PCLAI is naturally a regression method in a coordinate space. For HPRC Release 2, coordinates are reported in PCA space as a default surrogate for genetic distance.


How HPRC Release 2 results were generated (high level)

  1. Reference embedding: Construct a reference PCA embedding (from 1000 Genomes using the Reference PCA Metadata).
  2. Windows: Split each haplotype into fixed windows of 1000 SNPs.
  3. Inference: Predict a coordinate for each window in the reference PCA space.
  4. Confidence: Output a confidence score per window for QC / filtering.

Discrete ancestry labeling is optional: you can bin coordinates into categories after the fact, but the primary output is continuous. If you require PCLAI discretization for downstream tasks, consult our Manual.

If you require impainting missing windows for downstream tasks, refer to our recommendation in our Manual.


Output format (BED)

We provide local ancestry results as BED, which works well in genome browsers and supports interval coloring via itemRgb.

Field Description
chrom Chromosome
chromStart Window start (0-based, inclusive)
chromEnd Window end (0-based, exclusive)
name {sample}/{hap}/{chrom}_wXXXX_(x,y) where (x,y) are the predicted coordinates (e.g., (PC1,PC2))
score Confidence score in [0,1000] (higher = more confident)
strand .
thickStart equals chromStart
thickEnd equals chromEnd
itemRgb R,G,B color derived from the predicted coordinate (exported as RGB; generated from a perceptual mapping)
centroid Discretized PCLAI annotation of the window corresponding to the ancestry centroid

Example BED row:

chr1    14486   805864  HG00097/h1/chr1_w0001_(0.438,-1.398)    991 .   14486   805864  222,162,255

Visualization tip: itemRgb lets you color each window by position in the embedding (e.g., mapping a 2D coordinate into a perceptual color space → RGB), so continuous shifts along the genome are visually apparent.

Can I train my own PCLAI model?

Yes! If you want to train PCLAI on your own data, follow the steps in our official PCLAI repo.

Cite

When using the PCLAI method or PCLAI outputs, please cite the following paper:

@article{geleta_pclai_2026,
    author = {Geleta, Margarita and Mas Montserrat, Daniel and Ioannidis, Nilah M. and Ioannidis, Alexander G.},
    title = {{Point cloud local ancestry inference (PCLAI): continuous coordinate-based ancestry along the genome}},
    year = {2026},
    journal = {biorxiv},
    doi={10.64898/2026.03.23.713813}
}

About

Point Cloud Local Ancestry Inference (PCLAI) on HPRC Release 2 samples

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors