Skip to content

perf: use jemalloc for make_examples in the runtime image#1087

Open
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_jemalloc-image
Open

perf: use jemalloc for make_examples in the runtime image#1087
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_jemalloc-image

Conversation

@tfenne

@tfenne tfenne commented Jun 22, 2026

Copy link
Copy Markdown

What

Preload jemalloc as the allocator for the
make_examples family of binaries in the published Docker image, instead of the default glibc malloc.

Why

make_examples is a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibc malloc.

The preload is scoped to the make_examples family only. call_variants is TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.

Impact

~7.3% faster make_examples on a 30× WGS HG003 chr20 run with the production wgs config (7-channel model + --call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).

Verified the wrapper actually engages it: the make_examples python process launched via the wrapped entrypoint maps libjemalloc.so.2 and has LD_PRELOAD=libjemalloc.so.2 in its environment; the unwrapped entrypoint does not.

Changes (1 file)

  • Dockerfile:
    • install the distro libjemalloc2 package in the runtime image;
    • prefix LD_PRELOAD=libjemalloc.so.2 on the make_examples,
      multisample_make_examples, and make_examples_somatic wrappers.

The bare so name (libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.

Correctness

This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in malloc/free replacement; make_examples output is bit-identical. No source files are touched.

Notes

I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.

make_examples spends a large share of time in allocation-heavy pileup and
local-realignment work. Preloading jemalloc (vs glibc malloc) measurably
reduces its wall-clock; it has no measurable effect on call_variants (TF
inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers
only. Installed via the distro libjemalloc2 package; the bare soname
(libjemalloc.so.2) keeps it architecture-portable.
@tfenne tfenne force-pushed the tf_jemalloc-image branch from f5129f6 to da4f616 Compare June 23, 2026 04:26
… pangenome images

The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload.

This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.
@pichuan

pichuan commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Hi @tfenne ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

@pichuan pichuan self-assigned this Jun 24, 2026
@tfenne

tfenne commented Jun 24, 2026

Copy link
Copy Markdown
Author

Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs.

@tfenne

tfenne commented Jun 26, 2026

Copy link
Copy Markdown
Author

I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a c8a.4xlarge at AWS with 16 cores / 16 shards. In that setup:

baseline runtime: 113.1
pr runtime: 105.7
% change: ~7.3% improvement

@pichuan pichuan self-requested a review June 26, 2026 16:37
@pgrosu

pgrosu commented Jun 26, 2026

Copy link
Copy Markdown

Hi Tim (@tfenne) and Pi-Chuan (@pichuan),

Since make_examples is launched as multiple single-threaded processes $-$ basically not sharing their heap as compared to multi-threaded applications $-$ you can probably further optimize it through the environmental variable MALLOC_CONF by turning off and limiting unused resources. A more process-centric start might be the following configuration:

export MALLOC_CONF="narenas:1,background_thread:true,tcache:true,tcache_max:65536,dirty_decay_ms:10000,muzzy_decay_ms:5000"

and then tune from there by playing with the cache size and cleanup as it relates to the number of threads for different sample types. Fortunately, jemalloc provides many options to play with.

It might be interesting to see how it compares to tcmalloc, as the thread lifespan under some runtime condition might favor that architecture. Again, lots of fun to tune through many options ;)

Hope it helps,
~p

@pichuan

pichuan commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Hi @tfenne,

I ran benchmarks on the changes from this PR using our internal case studies pipeline (running on c3d-standard-16 instances, 5 trials per sample).

The testing methodology and baseline codebase are the same as described in #1086 (comment).

Output and Stage Observations:

  • md5sum: The output VCFs and gVCFs from gh1087 have exactly the same md5sum hashes as the head937500229 baseline.
  • Other stages: Runtimes for call_variants, postprocess_variants, and vcf_stats remain virtually identical between baseline and PR runs (differences are within standard ~0.5% – 2.0% run-to-run noise). This is expected since LD_PRELOAD is tightly scoped to just the make_examples wrappers.

Runtime summary:

The results show consistent runtime improvements across the board for the make_examples stage (ranging from 4.7% to 12.3% speedup), saving up to ~22–25 minutes on the large WGS, PacBio, and ONT runs.

Runtime Comparison: head937500229 (baseline) vs gh1087 (PR)

uid stage head937500229 (mean) gh1087 (mean) speedup (sec) speedup (%)
wgs make_examples 11180.41s (3h 6m 20s) 9806.62s (2h 43m 26s) 1373.79s 12.29%
total 17412.56s (4h 50m 12s) 16153.49s (4h 29m 13s) 1259.07s 7.23%
ont-r104 make_examples 13062.19s (3h 37m 42s) 11527.95s (3h 12m 7s) 1534.24s 11.75%
total 25910.83s (7h 11m 50s) 24457.72s (6h 47m 37s) 1453.11s 5.61%
pacbio make_examples 8615.63s (2h 23m 35s) 7671.91s (2h 7m 51s) 943.72s 10.95%
total 15373.77s (4h 16m 13s) 14407.26s (4h 0m 7s) 966.51s 6.29%
hybrid-pacbio-illumina make_examples 15513.45s (4h 18m 33s) 14417.19s (4h 0m 17s) 1096.26s 7.07%
total 39323.79s (10h 55m 23s) 38229.74s (10h 37m 9s) 1094.05s 2.78%
rnaseq make_examples 1506.26s (25m 6s) 1418.62s (23m 38s) 87.64s 5.82%
total 1805.56s (30m 5s) 1718.49s (28m 38s) 87.07s 4.82%
exome make_examples 481.59s (8m 1s) 458.96s (7m 38s) 22.63s 4.70%
total 659.81s (10m 59s) 637.41s (10m 37s) 22.40s 3.39%
Click to view raw `gh1087` runtime table

gh1087 Runtime Table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 458.96 7.741 5 7m 38s
exome HG003 call_variants 151.24 1.731 5 2m 31s
exome HG003 postprocess_variants 27.21 0.177 5 27s
exome HG003 vcf_stats 5.53 0.038 5 5s
exome HG003 total 637.41 9.455 5 10m 37s
hybrid-pacbio-illumina HG003 make_examples 14417.19 39.708 5 4h 17s
hybrid-pacbio-illumina HG003 call_variants 23515.27 29.91 5 6h 31m 55s
hybrid-pacbio-illumina HG003 postprocess_variants 297.29 6.776 5 4m 57s
hybrid-pacbio-illumina HG003 vcf_stats 245.6 2.75 5 4m 5s
hybrid-pacbio-illumina HG003 total 38229.74 47.021 5 10h 37m 9s
ont-r104 HG003 make_examples 11527.95 17.278 5 3h 12m 7s
ont-r104 HG003 call_variants 11905.93 120.625 5 3h 18m 25s
ont-r104 HG003 postprocess_variants 1023.84 8.715 5 17m 3s
ont-r104 HG003 vcf_stats 359.27 1.067 5 5m 59s
ont-r104 HG003 total 24457.72 127.981 5 6h 47m 37s
pacbio HG003 make_examples 7671.91 22.629 5 2h 7m 51s
pacbio HG003 call_variants 6204.24 7.886 5 1h 43m 24s
pacbio HG003 postprocess_variants 531.11 8.841 5 8m 51s
pacbio HG003 vcf_stats 282.95 1.985 5 4m 42s
pacbio HG003 total 14407.26 23.523 5 4h 7s
rnaseq HG005 make_examples 1418.62 9.414 5 23m 38s
rnaseq HG005 call_variants 94.95 0.119 5 1m 34s
rnaseq HG005 postprocess_variants 204.92 0.65 5 3m 24s
rnaseq HG005 vcf_stats 4.94 0.04 5 4s
rnaseq HG005 total 1718.49 9.864 5 28m 38s
wgs HG003 make_examples 9806.62 30.525 5 2h 43m 26s
wgs HG003 call_variants 5935.29 192.789 5 1h 38m 55s
wgs HG003 postprocess_variants 411.57 8.592 5 6m 51s
wgs HG003 vcf_stats 254.71 1.315 5 4m 14s
wgs HG003 total 16153.49 201.882 5 4h 29m 13s
Click to view raw `head937500229` runtime table

head937500229 Runtime Table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 481.59 8.884 5 8m 1s
exome HG003 call_variants 150.94 0.808 5 2m 30s
exome HG003 postprocess_variants 27.28 0.182 5 27s
exome HG003 vcf_stats 5.54 0.058 5 5s
exome HG003 total 659.81 9.358 5 10m 59s
hybrid-pacbio-illumina HG003 make_examples 15513.45 51.654 5 4h 18m 33s
hybrid-pacbio-illumina HG003 call_variants 23520.08 29.808 5 6h 32m 0s
hybrid-pacbio-illumina HG003 postprocess_variants 290.26 4.803 5 4m 50s
hybrid-pacbio-illumina HG003 vcf_stats 243.47 2.018 5 4m 3s
hybrid-pacbio-illumina HG003 total 39323.79 67.138 5 10h 55m 23s
ont-r104 HG003 make_examples 13062.19 156.502 5 3h 37m 42s
ont-r104 HG003 call_variants 11814.9 7.661 5 3h 16m 54s
ont-r104 HG003 postprocess_variants 1033.73 8.6 5 17m 13s
ont-r104 HG003 vcf_stats 361.09 3.496 5 6m 1s
ont-r104 HG003 total 25910.83 158.26 5 7h 11m 50s
pacbio HG003 make_examples 8615.63 155.868 5 2h 23m 35s
pacbio HG003 call_variants 6237.02 32.756 5 1h 43m 57s
pacbio HG003 postprocess_variants 521.13 8.723 5 8m 41s
pacbio HG003 vcf_stats 280.33 1.387 5 4m 40s
pacbio HG003 total 15373.77 182.451 5 4h 16m 13s
rnaseq HG005 make_examples 1506.26 9.424 5 25m 6s
rnaseq HG005 call_variants 94.85 0.066 5 1m 34s
rnaseq HG005 postprocess_variants 204.45 1.424 5 3m 24s
rnaseq HG005 vcf_stats 4.94 0.057 5 4s
rnaseq HG005 total 1805.56 10.267 5 30m 5s
wgs HG003 make_examples 11180.41 111.767 5 3h 6m 20s
wgs HG003 call_variants 5825.96 5.848 5 1h 37m 5s
wgs HG003 postprocess_variants 406.19 1.954 5 6m 46s
wgs HG003 vcf_stats 252.8 1.256 5 4m 12s
wgs HG003 total 17412.56 110.714 5 4h 50m 12s

I have created an internal changelist for the team to review. Since it's the weekend, we'll review and submit internally next week.
Out of curiosity, I'll also try to test this on our regular n2-standard-96 setup to see how the speedups hold up there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants