perf: use jemalloc for make_examples in the runtime image#1087
Conversation
make_examples spends a large share of time in allocation-heavy pileup and local-realignment work. Preloading jemalloc (vs glibc malloc) measurably reduces its wall-clock; it has no measurable effect on call_variants (TF inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers only. Installed via the distro libjemalloc2 package; the bare soname (libjemalloc.so.2) keeps it architecture-portable.
f5129f6 to
da4f616
Compare
… pangenome images The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload. This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.
|
Hi @tfenne , Thanks for the PR! Since I believe you're already familiar with our process, I'll go ahead and start the review. As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description. Please let me know if you have any concerns with this approach. -pichuan |
|
Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs. |
|
I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a baseline runtime: 113.1 |
|
Hi Tim (@tfenne) and Pi-Chuan (@pichuan), Since
and then tune from there by playing with the cache size and cleanup as it relates to the number of threads for different sample types. Fortunately, It might be interesting to see how it compares to tcmalloc, as the thread lifespan under some runtime condition might favor that architecture. Again, lots of fun to tune through many options ;) Hope it helps, |
|
Hi @tfenne, I ran benchmarks on the changes from this PR using our internal case studies pipeline (running on The testing methodology and baseline codebase are the same as described in #1086 (comment). Output and Stage Observations:
Runtime summary:The results show consistent runtime improvements across the board for the Runtime Comparison:
|
| uid | stage | head937500229 (mean) | gh1087 (mean) | speedup (sec) | speedup (%) |
|---|---|---|---|---|---|
| wgs | make_examples | 11180.41s (3h 6m 20s) | 9806.62s (2h 43m 26s) | 1373.79s | 12.29% |
| total | 17412.56s (4h 50m 12s) | 16153.49s (4h 29m 13s) | 1259.07s | 7.23% | |
| ont-r104 | make_examples | 13062.19s (3h 37m 42s) | 11527.95s (3h 12m 7s) | 1534.24s | 11.75% |
| total | 25910.83s (7h 11m 50s) | 24457.72s (6h 47m 37s) | 1453.11s | 5.61% | |
| pacbio | make_examples | 8615.63s (2h 23m 35s) | 7671.91s (2h 7m 51s) | 943.72s | 10.95% |
| total | 15373.77s (4h 16m 13s) | 14407.26s (4h 0m 7s) | 966.51s | 6.29% | |
| hybrid-pacbio-illumina | make_examples | 15513.45s (4h 18m 33s) | 14417.19s (4h 0m 17s) | 1096.26s | 7.07% |
| total | 39323.79s (10h 55m 23s) | 38229.74s (10h 37m 9s) | 1094.05s | 2.78% | |
| rnaseq | make_examples | 1506.26s (25m 6s) | 1418.62s (23m 38s) | 87.64s | 5.82% |
| total | 1805.56s (30m 5s) | 1718.49s (28m 38s) | 87.07s | 4.82% | |
| exome | make_examples | 481.59s (8m 1s) | 458.96s (7m 38s) | 22.63s | 4.70% |
| total | 659.81s (10m 59s) | 637.41s (10m 37s) | 22.40s | 3.39% |
Click to view raw `gh1087` runtime table
gh1087 Runtime Table
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 458.96 | 7.741 | 5 | 7m 38s |
| exome | HG003 | call_variants | 151.24 | 1.731 | 5 | 2m 31s |
| exome | HG003 | postprocess_variants | 27.21 | 0.177 | 5 | 27s |
| exome | HG003 | vcf_stats | 5.53 | 0.038 | 5 | 5s |
| exome | HG003 | total | 637.41 | 9.455 | 5 | 10m 37s |
| hybrid-pacbio-illumina | HG003 | make_examples | 14417.19 | 39.708 | 5 | 4h 17s |
| hybrid-pacbio-illumina | HG003 | call_variants | 23515.27 | 29.91 | 5 | 6h 31m 55s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 297.29 | 6.776 | 5 | 4m 57s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 245.6 | 2.75 | 5 | 4m 5s |
| hybrid-pacbio-illumina | HG003 | total | 38229.74 | 47.021 | 5 | 10h 37m 9s |
| ont-r104 | HG003 | make_examples | 11527.95 | 17.278 | 5 | 3h 12m 7s |
| ont-r104 | HG003 | call_variants | 11905.93 | 120.625 | 5 | 3h 18m 25s |
| ont-r104 | HG003 | postprocess_variants | 1023.84 | 8.715 | 5 | 17m 3s |
| ont-r104 | HG003 | vcf_stats | 359.27 | 1.067 | 5 | 5m 59s |
| ont-r104 | HG003 | total | 24457.72 | 127.981 | 5 | 6h 47m 37s |
| pacbio | HG003 | make_examples | 7671.91 | 22.629 | 5 | 2h 7m 51s |
| pacbio | HG003 | call_variants | 6204.24 | 7.886 | 5 | 1h 43m 24s |
| pacbio | HG003 | postprocess_variants | 531.11 | 8.841 | 5 | 8m 51s |
| pacbio | HG003 | vcf_stats | 282.95 | 1.985 | 5 | 4m 42s |
| pacbio | HG003 | total | 14407.26 | 23.523 | 5 | 4h 7s |
| rnaseq | HG005 | make_examples | 1418.62 | 9.414 | 5 | 23m 38s |
| rnaseq | HG005 | call_variants | 94.95 | 0.119 | 5 | 1m 34s |
| rnaseq | HG005 | postprocess_variants | 204.92 | 0.65 | 5 | 3m 24s |
| rnaseq | HG005 | vcf_stats | 4.94 | 0.04 | 5 | 4s |
| rnaseq | HG005 | total | 1718.49 | 9.864 | 5 | 28m 38s |
| wgs | HG003 | make_examples | 9806.62 | 30.525 | 5 | 2h 43m 26s |
| wgs | HG003 | call_variants | 5935.29 | 192.789 | 5 | 1h 38m 55s |
| wgs | HG003 | postprocess_variants | 411.57 | 8.592 | 5 | 6m 51s |
| wgs | HG003 | vcf_stats | 254.71 | 1.315 | 5 | 4m 14s |
| wgs | HG003 | total | 16153.49 | 201.882 | 5 | 4h 29m 13s |
Click to view raw `head937500229` runtime table
head937500229 Runtime Table
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 481.59 | 8.884 | 5 | 8m 1s |
| exome | HG003 | call_variants | 150.94 | 0.808 | 5 | 2m 30s |
| exome | HG003 | postprocess_variants | 27.28 | 0.182 | 5 | 27s |
| exome | HG003 | vcf_stats | 5.54 | 0.058 | 5 | 5s |
| exome | HG003 | total | 659.81 | 9.358 | 5 | 10m 59s |
| hybrid-pacbio-illumina | HG003 | make_examples | 15513.45 | 51.654 | 5 | 4h 18m 33s |
| hybrid-pacbio-illumina | HG003 | call_variants | 23520.08 | 29.808 | 5 | 6h 32m 0s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 290.26 | 4.803 | 5 | 4m 50s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 243.47 | 2.018 | 5 | 4m 3s |
| hybrid-pacbio-illumina | HG003 | total | 39323.79 | 67.138 | 5 | 10h 55m 23s |
| ont-r104 | HG003 | make_examples | 13062.19 | 156.502 | 5 | 3h 37m 42s |
| ont-r104 | HG003 | call_variants | 11814.9 | 7.661 | 5 | 3h 16m 54s |
| ont-r104 | HG003 | postprocess_variants | 1033.73 | 8.6 | 5 | 17m 13s |
| ont-r104 | HG003 | vcf_stats | 361.09 | 3.496 | 5 | 6m 1s |
| ont-r104 | HG003 | total | 25910.83 | 158.26 | 5 | 7h 11m 50s |
| pacbio | HG003 | make_examples | 8615.63 | 155.868 | 5 | 2h 23m 35s |
| pacbio | HG003 | call_variants | 6237.02 | 32.756 | 5 | 1h 43m 57s |
| pacbio | HG003 | postprocess_variants | 521.13 | 8.723 | 5 | 8m 41s |
| pacbio | HG003 | vcf_stats | 280.33 | 1.387 | 5 | 4m 40s |
| pacbio | HG003 | total | 15373.77 | 182.451 | 5 | 4h 16m 13s |
| rnaseq | HG005 | make_examples | 1506.26 | 9.424 | 5 | 25m 6s |
| rnaseq | HG005 | call_variants | 94.85 | 0.066 | 5 | 1m 34s |
| rnaseq | HG005 | postprocess_variants | 204.45 | 1.424 | 5 | 3m 24s |
| rnaseq | HG005 | vcf_stats | 4.94 | 0.057 | 5 | 4s |
| rnaseq | HG005 | total | 1805.56 | 10.267 | 5 | 30m 5s |
| wgs | HG003 | make_examples | 11180.41 | 111.767 | 5 | 3h 6m 20s |
| wgs | HG003 | call_variants | 5825.96 | 5.848 | 5 | 1h 37m 5s |
| wgs | HG003 | postprocess_variants | 406.19 | 1.954 | 5 | 6m 46s |
| wgs | HG003 | vcf_stats | 252.8 | 1.256 | 5 | 4m 12s |
| wgs | HG003 | total | 17412.56 | 110.714 | 5 | 4h 50m 12s |
I have created an internal changelist for the team to review. Since it's the weekend, we'll review and submit internally next week.
Out of curiosity, I'll also try to test this on our regular n2-standard-96 setup to see how the speedups hold up there!
What
Preload jemalloc as the allocator for the
make_examplesfamily of binaries in the published Docker image, instead of the default glibcmalloc.Why
make_examplesis a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibcmalloc.The preload is scoped to the make_examples family only.
call_variantsis TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.Impact
~7.3% faster
make_exampleson a 30× WGS HG003chr20run with the productionwgsconfig (7-channel model +--call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).Verified the wrapper actually engages it: the
make_examplespython process launched via the wrapped entrypoint mapslibjemalloc.so.2and hasLD_PRELOAD=libjemalloc.so.2in its environment; the unwrapped entrypoint does not.Changes (1 file)
Dockerfile:libjemalloc2package in the runtime image;LD_PRELOAD=libjemalloc.so.2on themake_examples,multisample_make_examples, andmake_examples_somaticwrappers.The bare so name (
libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.Correctness
This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in
malloc/freereplacement;make_examplesoutput is bit-identical. No source files are touched.Notes
I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.