perf: use jemalloc for make_examples in the runtime image by tfenne · Pull Request #1087 · google/deepvariant

tfenne · 2026-06-22T22:56:52Z

What

Preload jemalloc as the allocator for the
make_examples family of binaries in the published Docker image, instead of the default glibc malloc.

Why

make_examples is a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibc malloc.

The preload is scoped to the make_examples family only. call_variants is TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.

Impact

~7.3% faster make_examples on a 30× WGS HG003 chr20 run with the production wgs config (7-channel model + --call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).

Verified the wrapper actually engages it: the make_examples python process launched via the wrapped entrypoint maps libjemalloc.so.2 and has LD_PRELOAD=libjemalloc.so.2 in its environment; the unwrapped entrypoint does not.

Changes (1 file)

Dockerfile:
- install the distro libjemalloc2 package in the runtime image;
- prefix LD_PRELOAD=libjemalloc.so.2 on the make_examples,
  multisample_make_examples, and make_examples_somatic wrappers.

The bare so name (libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.

Correctness

This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in malloc/free replacement; make_examples output is bit-identical. No source files are touched.

Notes

I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.

make_examples spends a large share of time in allocation-heavy pileup and local-realignment work. Preloading jemalloc (vs glibc malloc) measurably reduces its wall-clock; it has no measurable effect on call_variants (TF inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers only. Installed via the distro libjemalloc2 package; the bare soname (libjemalloc.so.2) keeps it architecture-portable.

… pangenome images The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload. This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.

pichuan · 2026-06-24T04:32:22Z

Hi @tfenne ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

tfenne · 2026-06-24T04:49:35Z

Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs.

tfenne · 2026-06-26T15:14:20Z

I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a c8a.4xlarge at AWS with 16 cores / 16 shards. In that setup:

baseline runtime: 113.1
pr runtime: 105.7
% change: ~7.3% improvement

pgrosu · 2026-06-26T20:56:16Z

Hi Tim (@tfenne) and Pi-Chuan (@pichuan),

Since make_examples is launched as multiple single-threaded processes $-$ basically not sharing their heap as compared to multi-threaded applications $-$ you can probably further optimize it through the environmental variable MALLOC_CONF by turning off and limiting unused resources. A more process-centric start might be the following configuration:

export MALLOC_CONF="narenas:1,background_thread:true,tcache:true,tcache_max:65536,dirty_decay_ms:10000,muzzy_decay_ms:5000"

and then tune from there by playing with the cache size and cleanup as it relates to the number of threads for different sample types. Fortunately, jemalloc provides many options to play with.

It might be interesting to see how it compares to tcmalloc, as the thread lifespan under some runtime condition might favor that architecture. Again, lots of fun to tune through many options ;)

Hope it helps,
~p

pichuan · 2026-06-27T06:16:47Z

Hi @tfenne,

I ran benchmarks on the changes from this PR using our internal case studies pipeline (running on c3d-standard-16 instances, 5 trials per sample).

The testing methodology and baseline codebase are the same as described in #1086 (comment).

Output and Stage Observations:

md5sum: The output VCFs and gVCFs from gh1087 have exactly the same md5sum hashes as the head937500229 baseline.
Other stages: Runtimes for call_variants, postprocess_variants, and vcf_stats remain virtually identical between baseline and PR runs (differences are within standard ~0.5% – 2.0% run-to-run noise). This is expected since LD_PRELOAD is tightly scoped to just the make_examples wrappers.

Runtime summary:

The results show consistent runtime improvements across the board for the make_examples stage (ranging from 4.7% to 12.3% speedup), saving up to ~22–25 minutes on the large WGS, PacBio, and ONT runs.

Runtime Comparison: `head937500229` (baseline) vs `gh1087` (PR)

uid	stage	head937500229 (mean)	gh1087 (mean)	speedup (sec)	speedup (%)
wgs	make_examples	11180.41s (3h 6m 20s)	9806.62s (2h 43m 26s)	1373.79s	12.29%
	total	17412.56s (4h 50m 12s)	16153.49s (4h 29m 13s)	1259.07s	7.23%
ont-r104	make_examples	13062.19s (3h 37m 42s)	11527.95s (3h 12m 7s)	1534.24s	11.75%
	total	25910.83s (7h 11m 50s)	24457.72s (6h 47m 37s)	1453.11s	5.61%
pacbio	make_examples	8615.63s (2h 23m 35s)	7671.91s (2h 7m 51s)	943.72s	10.95%
	total	15373.77s (4h 16m 13s)	14407.26s (4h 0m 7s)	966.51s	6.29%
hybrid-pacbio-illumina	make_examples	15513.45s (4h 18m 33s)	14417.19s (4h 0m 17s)	1096.26s	7.07%
	total	39323.79s (10h 55m 23s)	38229.74s (10h 37m 9s)	1094.05s	2.78%
rnaseq	make_examples	1506.26s (25m 6s)	1418.62s (23m 38s)	87.64s	5.82%
	total	1805.56s (30m 5s)	1718.49s (28m 38s)	87.07s	4.82%
exome	make_examples	481.59s (8m 1s)	458.96s (7m 38s)	22.63s	4.70%
	total	659.81s (10m 59s)	637.41s (10m 37s)	22.40s	3.39%

Click to view raw `gh1087` runtime table

`gh1087` Runtime Table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	458.96	7.741	5	7m 38s
exome	HG003	call_variants	151.24	1.731	5	2m 31s
exome	HG003	postprocess_variants	27.21	0.177	5	27s
exome	HG003	vcf_stats	5.53	0.038	5	5s
exome	HG003	total	637.41	9.455	5	10m 37s
hybrid-pacbio-illumina	HG003	make_examples	14417.19	39.708	5	4h 17s
hybrid-pacbio-illumina	HG003	call_variants	23515.27	29.91	5	6h 31m 55s
hybrid-pacbio-illumina	HG003	postprocess_variants	297.29	6.776	5	4m 57s
hybrid-pacbio-illumina	HG003	vcf_stats	245.6	2.75	5	4m 5s
hybrid-pacbio-illumina	HG003	total	38229.74	47.021	5	10h 37m 9s
ont-r104	HG003	make_examples	11527.95	17.278	5	3h 12m 7s
ont-r104	HG003	call_variants	11905.93	120.625	5	3h 18m 25s
ont-r104	HG003	postprocess_variants	1023.84	8.715	5	17m 3s
ont-r104	HG003	vcf_stats	359.27	1.067	5	5m 59s
ont-r104	HG003	total	24457.72	127.981	5	6h 47m 37s
pacbio	HG003	make_examples	7671.91	22.629	5	2h 7m 51s
pacbio	HG003	call_variants	6204.24	7.886	5	1h 43m 24s
pacbio	HG003	postprocess_variants	531.11	8.841	5	8m 51s
pacbio	HG003	vcf_stats	282.95	1.985	5	4m 42s
pacbio	HG003	total	14407.26	23.523	5	4h 7s
rnaseq	HG005	make_examples	1418.62	9.414	5	23m 38s
rnaseq	HG005	call_variants	94.95	0.119	5	1m 34s
rnaseq	HG005	postprocess_variants	204.92	0.65	5	3m 24s
rnaseq	HG005	vcf_stats	4.94	0.04	5	4s
rnaseq	HG005	total	1718.49	9.864	5	28m 38s
wgs	HG003	make_examples	9806.62	30.525	5	2h 43m 26s
wgs	HG003	call_variants	5935.29	192.789	5	1h 38m 55s
wgs	HG003	postprocess_variants	411.57	8.592	5	6m 51s
wgs	HG003	vcf_stats	254.71	1.315	5	4m 14s
wgs	HG003	total	16153.49	201.882	5	4h 29m 13s

Click to view raw `head937500229` runtime table

`head937500229` Runtime Table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	481.59	8.884	5	8m 1s
exome	HG003	call_variants	150.94	0.808	5	2m 30s
exome	HG003	postprocess_variants	27.28	0.182	5	27s
exome	HG003	vcf_stats	5.54	0.058	5	5s
exome	HG003	total	659.81	9.358	5	10m 59s
hybrid-pacbio-illumina	HG003	make_examples	15513.45	51.654	5	4h 18m 33s
hybrid-pacbio-illumina	HG003	call_variants	23520.08	29.808	5	6h 32m 0s
hybrid-pacbio-illumina	HG003	postprocess_variants	290.26	4.803	5	4m 50s
hybrid-pacbio-illumina	HG003	vcf_stats	243.47	2.018	5	4m 3s
hybrid-pacbio-illumina	HG003	total	39323.79	67.138	5	10h 55m 23s
ont-r104	HG003	make_examples	13062.19	156.502	5	3h 37m 42s
ont-r104	HG003	call_variants	11814.9	7.661	5	3h 16m 54s
ont-r104	HG003	postprocess_variants	1033.73	8.6	5	17m 13s
ont-r104	HG003	vcf_stats	361.09	3.496	5	6m 1s
ont-r104	HG003	total	25910.83	158.26	5	7h 11m 50s
pacbio	HG003	make_examples	8615.63	155.868	5	2h 23m 35s
pacbio	HG003	call_variants	6237.02	32.756	5	1h 43m 57s
pacbio	HG003	postprocess_variants	521.13	8.723	5	8m 41s
pacbio	HG003	vcf_stats	280.33	1.387	5	4m 40s
pacbio	HG003	total	15373.77	182.451	5	4h 16m 13s
rnaseq	HG005	make_examples	1506.26	9.424	5	25m 6s
rnaseq	HG005	call_variants	94.85	0.066	5	1m 34s
rnaseq	HG005	postprocess_variants	204.45	1.424	5	3m 24s
rnaseq	HG005	vcf_stats	4.94	0.057	5	4s
rnaseq	HG005	total	1805.56	10.267	5	30m 5s
wgs	HG003	make_examples	11180.41	111.767	5	3h 6m 20s
wgs	HG003	call_variants	5825.96	5.848	5	1h 37m 5s
wgs	HG003	postprocess_variants	406.19	1.954	5	6m 46s
wgs	HG003	vcf_stats	252.8	1.256	5	4m 12s
wgs	HG003	total	17412.56	110.714	5	4h 50m 12s

I have created an internal changelist for the team to review. Since it's the weekend, we'll review and submit internally next week.
Out of curiosity, I'll also try to test this on our regular n2-standard-96 setup to see how the speedups hold up there!

tfenne force-pushed the tf_jemalloc-image branch from f5129f6 to da4f616 Compare June 23, 2026 04:26

pichuan self-assigned this Jun 24, 2026

pichuan self-requested a review June 26, 2026 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: use jemalloc for make_examples in the runtime image#1087

perf: use jemalloc for make_examples in the runtime image#1087
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_jemalloc-image

tfenne commented Jun 22, 2026 •

edited

Loading

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

pgrosu commented Jun 26, 2026

Uh oh!

pichuan commented Jun 27, 2026

`gh1087` Runtime Table

`head937500229` Runtime Table

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tfenne commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Impact

Changes (1 file)

Correctness

Notes

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

pgrosu commented Jun 26, 2026

Uh oh!

pichuan commented Jun 27, 2026

Output and Stage Observations:

Runtime summary:

Runtime Comparison: head937500229 (baseline) vs gh1087 (PR)

gh1087 Runtime Table

head937500229 Runtime Table

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tfenne commented Jun 22, 2026 •

edited

Loading

Runtime Comparison: `head937500229` (baseline) vs `gh1087` (PR)

`gh1087` Runtime Table

`head937500229` Runtime Table